Showing posts with label log. Show all posts
Showing posts with label log. Show all posts

Tuesday, March 02, 2010

ZFS Log Devices: A Review of the DDRdrive X1

My previous notes here have covered the trends to commodity storage, my happiness with most things ZFS and Nexenta, and how someday this will all make for a great primary storage story. At Stanford, we have a lot of disk-to-disk backup storage based on Nexenta solutions, using iSCSI or direct attached storage. We have also had some primary tier uses, but have had to play fast and loose with ZFS to get comparable performance. In essence, we sacrificed some of the ensured data integrity of ZFS to meet end users expectations of what file servers provide.

A typical thing that was done was to set these values:
set zfs:zil_disable = 1
set zfs:zfs_nocacheflush = 1

These flags allowed a ZFS appliance to perform similarly to Linux or other systems when it came to NFS server performance. When you are writing a lot of large files, the ZFS Intent Log's additional latency doesn't affect NFS client performance. However, when these same clients expect their fsyncs to be honored on the back end with mixed file sizes that trend to a large volume of small writes, we start to see pathologically poor performance with the ZIL enabled. We can measure the performance at 400KB/sec in some of my basic synthetic tests. With the ZIL disabled, I generally got 3-5MB/sec or so, or 10x the performance. That's cheating and not so safe if the client thinks a write is complete but the backend server doesn't commit it before power loss or crash.

One ray of hope previously mentioned on this site was the Gigabyte i-RAM. This battery backed SATA-I solution held some promise, but at the time I used it I found a few difficulties. First, the state of the art at that time did not allow removal of log (ZIL-dedicated) devices from pools. One had to recreate a pool if the log device failed. That raised some problems with the i-RAM. First, I had it go offline twice requiring resetting the device, essentially blanking it out and requiring re-initializing it as a drive with ZFS. Second, the connection was SATA-I only, with it not playing well with certain SATA-II chipsets or mixed with SATA-II devices. Many users had to enable it in IDE mode versus the preferred AHCI mode.

Time has passed, and new solutions present themselves. First, log devices can be added or removed from a pool at any time, on the fly. Also new to the discussion is the DDRdrive X1 product. This mixed RAM and NAND device provides for a 4G drive image with extremely high IOPS and a solution to save to stable store (NAND SLC flash) if power is lost on the PCI bus. The device itself is connected to a PCI-Express bus, with drivers for OpenSolaris/Nexenta (among others) that make it visible as a SCSI device.

I tried different scenarios with this ZIL device, and all of them make it a sweet little device. I had mixed files that I pushed onto the appliance via NFS (linux client) and found that I could multiply the number of clients and linearly increase performance. Where I would hit 450KB/sec without the ZIL device but not improve that rate by much with additional writers of data, using the ZIL log device immediately resulted in a good 7MB/sec of performance, with 4 concurrent write jobs yielding 27MB/sec. During this test, my X1 showed only a 20% busy rate using iostat. It would appear that I should get up to 135MB/sec at this rate (5x the concurrent writers), but my network connection was just gig-e, so getting anywhere near 120+MB/sec would be phenomenal. Another sample of mixed files with 5 concurrent writers pushed the non-X1 config to 1.5MB/sec, but in this case, the X1 took my performance numbers to 45-50MB/sec.

So what is providing all this performance? As I mentioned above, the fsyncs on writes from the NFS client enforce synchronous transactions in ZFS when the ZIL is not disabled. My IOPS (I/O Operations per second) without a X1 log device were measured around 120 IOPS. With the dedicated RAM/NAND DDRdrive X1 solution, I easily approach 5000 IOPS. Those commits happen quickly, with the final stable store to your disk array laid out in your more typical 128K blocks per IOP. This dedicated ZIL device has been shown to do up to 200000 IOPS in synthetic benchmarks. Lets try the NFS case one more time, in a somewhat more practical test.

Commonly, in simulation, CAD applications, software development, or the like you will be conversing with the file server committing hundreds to thousands of small file writes. To test this out and make it the worse case scenario of disk block-sized files, I created a directory of 1000 512 byte files on the clients local disk. I did multiple runs to make sure this fit in memory so that we were measuring file server write performance. I then ran 400 concurrent jobs writing this to the file server into separate target directories. First, with the dedicated ZIL device enabled, I got 24MB/sec write rates averaging 6000 IOPS. I did spike up to 43K IOPS and 35MB/sec, likely when committing some of the metadata associated with all these files and directories. Still, the X1 was only averaging 20% busy during this test.

Next, I disabled the DDRdrive X1 and tried again, hitting the same old wall. This was the pathological case. With 400 concurrent writes I still just got 120 IOPS and 450KB/sec. My only thought at the time was "sad, very sad".

You can draw your own conclusions from this mostly not-too-scientific test. For me, I now know of an affordable device that has none of the drawbacks (4K block size, wear leveling) of SSD drives for use as a ZIL device. One can now put together a commodity storage solution with this and Nexenta, and have the same expected performance without compromise as one would expect from any first tier storage platform.

That leads me to the "one more thing" category. I decided to place some ESX NFS storage-pooled volumes on this box, and compare it to the performance of the NetApps we use to manage our ESX VMs (NFS). The file access modes of the VMs tend to be similar to mixed size file operations, but they do tend to be larger writes so the ZIL may not have as drastic of an effect. Anyway, I tried it without the X1 and I got 30-40MB/sec measured disk performance from operations within the VM (random tests, dd, etc). Enabling the ZIL device, I got 90-120MB/sec rates, so we still got a 3x improvement. I couldn't easily isolate all traffic away from my NetApps, but I averaged 65MB/sec on those tests.

Here, I think the conclusion I can draw is this: The dedicated ZIL device again improved performance up to matching what I theoretically can get from my network path. The comparison one can safely make with a NetApp is not that its faster, as my test ran under different loads, but that it likely can match the line rates of your hardware and remove from the equation any concern for filesystem and disk array performance. Perhaps in a 10G network environment or with some link aggregation we can start to stress the DDRdrive X1, but for now its obvious that it enables commodity storage solutions to meet typical NAS performance expectations.


Tuesday, May 27, 2008

The problem with slogs (How I lost everything!)...

A while back, I spoke of the virtues of using a slog device with ZFS. The system I went into production with had an Nvidia-based SATA controller onboard and a Gigabyte i-RAM card. No problems there, but at the time it was a cmdk driver (PATA mode) for my OpenSolaris-based NexentaStor NAS. After a while, I got an error where the i-RAM "reset" and the log went degraded. The system simple started to use the data disks for the intent log. So, no harm done. Its important to note that the kernel was a B70 OpenSolaris build.

Later, I wanted to upgrade to NexentaStor 1.0, which had B85. Post upgrade or using a boot CD, it would never come up with the i-RAM attached. The newer kernel was an nv_sata driver, and I could always get it to work in B70, so I reverted to that. This is one nice feature that Nexenta has had for quite some time, in that the whole OS is checkpointed using ZFS to allow reversions if an upgrade doesn't take. Well, the NAS doesn't like having a degraded volume, so I've been trying to "fix" the log device. Currently, in ZFS, log devices cannot be removed, but only replaced. So, I tried to replace it using the "zpool" command. Replacing the failed log with itself always fails as its "currently in use by another zpool". I figured out a way around that, and that was to fully clear the log using something like "dd if=/dev/zero of=/dev/mylogdevice bs=64k". I was able to upgrade my system to B85, and then I attempted to replace the drive again, and it looked like it was working:


pool: data
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scrub: resilver in progress for 0h0m, 0.00% done, 450151253h54m to go
config:

NAME STATE READ WRITE CKSUM
data DEGRADED 0 0 0
raidz1 ONLINE 0 0 0
c7t0d0 ONLINE 0 0 0
c7t1d0 ONLINE 0 0 0
c6t1d0 ONLINE 0 0 0
logs DEGRADED 0 0 0
replacing DEGRADED 0 0 0
c5d0 UNAVAIL 0 0 0 cannot open
c8t1d0 ONLINE 0 0 0

errors: No known data errors


Note well, that it is replacing one log device with another (using the new nv_sata naming). However, after it reached 1% it would always restart the resilver with no ZFS activity, no snapshots, etc. The system was busy resilvering and resetting, getting no where. I decided to reboot to B70, and as soon as that came up, it started to resilver immediately and it proceeded after quite a long time for a 2GB drive to complete the resilver. So, everything was now fine, right?

This is where things really went wrong. At the end of the resilver, it still considered the volume degraded, and looked like the above output but with only one log device. Rebooting the system, the volume started spewing out ZFS errors, and checksums counters went flying. My pool went offline. Another reboot, this time with the log device disconnected due to nv_sata not wanting it connected for booting purposes causes immediate kernel panics. What the hell was going on? Using the boot cd, I tried to import the volume. It told me that the volume had insufficient devices. A log device shouldn't be necessary for operation, as it hadn't needed it before. I attached the log device and ran cfgadm to configure it, which works and gets around the boot time nv_sata/i-RAM issue. Now it told me that I have sufficient devices, but what happened next was worse. The output showed that my volume consisted of one RAIDZ, an empty log device definition, and additionally my i-RAM as an additional degraded drive added to the array as a stripe! No ZFS command was run here. It was simply the state of the system relative to what the previous resilver had accomplished.

Any attempt to import the volume fails with a ZFS error regarding its inability to "iterate all the filesystems" or something to that affect. I was able to mount various ZFS volumes read-only by using the "zfs mount -o ro data/proj" or similar. I then brought up my network and manually had to transfer the files off to recover, but this pool is now dead to the world.

What lessons have I learned? Slog devices in ZFS, though a great feature, should not be used in production until they can be evacuated. There may be errors in the actions I took above, but bugs that I see include the inability for the nv_sata driver to deal with the i-RAM device for some odd reason, at least in B82 and B85 (as I've so far tested). The other bug is that a log replace appears to either not resilver at all (B85) or, when resilvering in older releases, causes the system to not correctly resilver the log but instead to shim the slog in as a data stripe. I simply can't see how that is by any stretch of the imagination by design.

Thursday, January 17, 2008

Using the iRam: Improving ZFS perceived transaction latency

I've been long overdue in reviewing the Gigabyte iRam card and its affect on performance of your favorite ZFS NAS product. NexentaStor already supports log devices, so the time appeared right to get one of these for a client I consult with to help deal with the noticeable pauses one can see when heavy reads and writes compete on a ZFS pool. I hope that the single threaded nature of those commits is resolved at some future point, but the iRam card appears to be a simple way to inject an NVRAM-like device into your commodity NAS solution.

The card itself is simply four DIMM sockets for DDR RAM, with a battery backup, reset switch, power driven from a PCI bus, and a single SATA-I connection to plug the unit into your existing SATA interfaces. Already you can see that the performance limit is 150MB/sec based on the SATA-I spec. What does this card do though? Near instant reads and writes in a safe battery-backed ramdisk that your system sees as a 2GB or 4GB drive, just what you'd want for a dedicated write commit device. In the case of many spindles in an array, you likely can do better than this device for true performance, but in the case of many small commits, the near perfect latency of RAM is much more ideal to keep writes happening without stalling the drives for reads. Since its a "slog" device by ZFS terms, it will regularly commit to the real underlying storage at full disk bandwidth. Therefore, even when writes must compete with reads on the physical disk, you limit your exposure to perceived stalls in I/O request even in the higher load cases.

For my non-production test, I actually put together the worse case scenario: An iSCSI backed ZFS array with NFS clients and many small files. In this case, any NFS writes require 3 fsyncs on the back end storage as required by NFS (create,modify,close). This is actually similar to CAD libraries, which the test was made to reflect. Using iSCSI devices, you can inflict much higher latencies. My iSCSI targets are actually older SATA-I drives themselves on a SBEi Linux based target using 3ware 8500s. Again, no where near ideal.

Creating a directory of 5000 small 8k files, I copied this from a linux gig-e connected client to a ZFS pool (made of two non-striped iSCSI luns), and got a meager 200K/sec write performance over NFS. If I stripe the data instead in the ZFS pool, I increased the numbers to 600K/sec at some points. Adding a 2GB Gigabyte iRam drive, I increased those numbers up to 9MB/sec, but averaging around 5MB/sec overall. That's at least 10 times the performance. Again, this test involves many i/o operations instead of using any bandwidth.

How fast can data be written to and read from that log device? My tests showed that 100MB/sec for reads and writes were common, with writes only bursting to those numbers for larger streaming data sets. In the case of the iSCSI nodes in question, each one could be pulled at a top rate of 45MB/sec, but averaging closer to 27MB/sec. Nominally, you can see that we are 3x better than at least these gig-e iSCSI devices.

The final production installation of the iRam device was with a SATA-II DAS array, and even in heavier load scenarios, we saw the wait cycle for write commits to the drives limited, and a steady 100+MB/sec use of the commit log (reads and writes). The only caveat for using such a device is that the current builds of OpenSolaris and thus NexentaStor do not allow you to remove it once added to a pool. A future release is supposed to address that.

Followers