Thursday, January 17, 2008

Using the iRam: Improving ZFS perceived transaction latency

I've been long overdue in reviewing the Gigabyte iRam card and its affect on performance of your favorite ZFS NAS product. NexentaStor already supports log devices, so the time appeared right to get one of these for a client I consult with to help deal with the noticeable pauses one can see when heavy reads and writes compete on a ZFS pool. I hope that the single threaded nature of those commits is resolved at some future point, but the iRam card appears to be a simple way to inject an NVRAM-like device into your commodity NAS solution.

The card itself is simply four DIMM sockets for DDR RAM, with a battery backup, reset switch, power driven from a PCI bus, and a single SATA-I connection to plug the unit into your existing SATA interfaces. Already you can see that the performance limit is 150MB/sec based on the SATA-I spec. What does this card do though? Near instant reads and writes in a safe battery-backed ramdisk that your system sees as a 2GB or 4GB drive, just what you'd want for a dedicated write commit device. In the case of many spindles in an array, you likely can do better than this device for true performance, but in the case of many small commits, the near perfect latency of RAM is much more ideal to keep writes happening without stalling the drives for reads. Since its a "slog" device by ZFS terms, it will regularly commit to the real underlying storage at full disk bandwidth. Therefore, even when writes must compete with reads on the physical disk, you limit your exposure to perceived stalls in I/O request even in the higher load cases.

For my non-production test, I actually put together the worse case scenario: An iSCSI backed ZFS array with NFS clients and many small files. In this case, any NFS writes require 3 fsyncs on the back end storage as required by NFS (create,modify,close). This is actually similar to CAD libraries, which the test was made to reflect. Using iSCSI devices, you can inflict much higher latencies. My iSCSI targets are actually older SATA-I drives themselves on a SBEi Linux based target using 3ware 8500s. Again, no where near ideal.

Creating a directory of 5000 small 8k files, I copied this from a linux gig-e connected client to a ZFS pool (made of two non-striped iSCSI luns), and got a meager 200K/sec write performance over NFS. If I stripe the data instead in the ZFS pool, I increased the numbers to 600K/sec at some points. Adding a 2GB Gigabyte iRam drive, I increased those numbers up to 9MB/sec, but averaging around 5MB/sec overall. That's at least 10 times the performance. Again, this test involves many i/o operations instead of using any bandwidth.

How fast can data be written to and read from that log device? My tests showed that 100MB/sec for reads and writes were common, with writes only bursting to those numbers for larger streaming data sets. In the case of the iSCSI nodes in question, each one could be pulled at a top rate of 45MB/sec, but averaging closer to 27MB/sec. Nominally, you can see that we are 3x better than at least these gig-e iSCSI devices.

The final production installation of the iRam device was with a SATA-II DAS array, and even in heavier load scenarios, we saw the wait cycle for write commits to the drives limited, and a steady 100+MB/sec use of the commit log (reads and writes). The only caveat for using such a device is that the current builds of OpenSolaris and thus NexentaStor do not allow you to remove it once added to a pool. A future release is supposed to address that.


Pentel said...

Have you considered installing 2 iRAMs and using RAID0 for better performance? Four running RAID10 would be even better for a production system, but that's quite a bit of real estate being taken up.

Have you considered small SSD drives? I can't imagine they'd perform nearly as well but they might be more reliable.

Matt (Joe Little) said...

I would consider multiple iRAM cards for multiple volumes, as you can only use one per zpool. Performance isn't critical as much as the latency and deterministic speeds. I looked at small SSD, but most that have any reasonable durability in the number of writes were pretty slow (20MB/sec). I'd want a similar 80MB/sec performance envelope.

I would consider a simple RAID1 of the iRAM, just for reliabilities sake in case you lose one card. There is no ability in ZFS currently to remove an existing slog device such as this. Having two is safer. (Joe Little) said...

A little updated. The latest OpenSolaris release (B94) contains a fix that should prevent a rebuild of a slog device from incorrectly being considered a standard pool vdev. Thus, slog devices have become much safer to use. I believe the next release of NexentaStor should contain this fix as well so that I can use it :).

myxiplx said...

b94 still isn't perfect for ZIL's I'm afraid. If your ZIL device dies, don't export your pool whatever you do or you won't be able to import it again.

Apparently, although ZFS is happy to mount a pool without a ZIL at boot time, it's currently unable to import a pool if the ZIL is missing. (Joe Little) said...

Hmm. I noticed that behavior before B94, when I lost my storage pool when the vdev failed. So, the problem has simply moved to another part of the system. If you can correctly replace the slog vdev, does the export/import then work?