Thursday, January 17, 2008

Using the iRam: Improving ZFS perceived transaction latency

I've been long overdue in reviewing the Gigabyte iRam card and its affect on performance of your favorite ZFS NAS product. NexentaStor already supports log devices, so the time appeared right to get one of these for a client I consult with to help deal with the noticeable pauses one can see when heavy reads and writes compete on a ZFS pool. I hope that the single threaded nature of those commits is resolved at some future point, but the iRam card appears to be a simple way to inject an NVRAM-like device into your commodity NAS solution.

The card itself is simply four DIMM sockets for DDR RAM, with a battery backup, reset switch, power driven from a PCI bus, and a single SATA-I connection to plug the unit into your existing SATA interfaces. Already you can see that the performance limit is 150MB/sec based on the SATA-I spec. What does this card do though? Near instant reads and writes in a safe battery-backed ramdisk that your system sees as a 2GB or 4GB drive, just what you'd want for a dedicated write commit device. In the case of many spindles in an array, you likely can do better than this device for true performance, but in the case of many small commits, the near perfect latency of RAM is much more ideal to keep writes happening without stalling the drives for reads. Since its a "slog" device by ZFS terms, it will regularly commit to the real underlying storage at full disk bandwidth. Therefore, even when writes must compete with reads on the physical disk, you limit your exposure to perceived stalls in I/O request even in the higher load cases.

For my non-production test, I actually put together the worse case scenario: An iSCSI backed ZFS array with NFS clients and many small files. In this case, any NFS writes require 3 fsyncs on the back end storage as required by NFS (create,modify,close). This is actually similar to CAD libraries, which the test was made to reflect. Using iSCSI devices, you can inflict much higher latencies. My iSCSI targets are actually older SATA-I drives themselves on a SBEi Linux based target using 3ware 8500s. Again, no where near ideal.

Creating a directory of 5000 small 8k files, I copied this from a linux gig-e connected client to a ZFS pool (made of two non-striped iSCSI luns), and got a meager 200K/sec write performance over NFS. If I stripe the data instead in the ZFS pool, I increased the numbers to 600K/sec at some points. Adding a 2GB Gigabyte iRam drive, I increased those numbers up to 9MB/sec, but averaging around 5MB/sec overall. That's at least 10 times the performance. Again, this test involves many i/o operations instead of using any bandwidth.

How fast can data be written to and read from that log device? My tests showed that 100MB/sec for reads and writes were common, with writes only bursting to those numbers for larger streaming data sets. In the case of the iSCSI nodes in question, each one could be pulled at a top rate of 45MB/sec, but averaging closer to 27MB/sec. Nominally, you can see that we are 3x better than at least these gig-e iSCSI devices.

The final production installation of the iRam device was with a SATA-II DAS array, and even in heavier load scenarios, we saw the wait cycle for write commits to the drives limited, and a steady 100+MB/sec use of the commit log (reads and writes). The only caveat for using such a device is that the current builds of OpenSolaris and thus NexentaStor do not allow you to remove it once added to a pool. A future release is supposed to address that.

Friday, January 11, 2008

Swept Under the Rug

In our day to day management of technology, we tend to pick paths that resolve the most pressing pain points. Inadvertently, we often also sweep certain problems under the rug, awaiting the day when it all must be cleaned up. Many choices do exactly this, solving the present problem while creating perhaps larger problems down the road. In my evolving strategy on storage, the move away from tape to disk-based online storage solves the most obvious problems but creates a whole series of other problems, including file based disaster recovery, long term maintenance of the underlying disk technology, true long term persistence of data, and general accessibility of the data by future technology. Today, I'll focus only on our next major pain point, disaster recovery.

Recently, a few instances occurred that underscored the need for better thought out solutions than what we already put in place. We thought we may be ahead of the curve with tiered copies of data on secondary NAS solutions, with our backup windows well within reason. Its obvious we made the right choice in doing incremental file based backups to secondary NAS, as the end data containers are universal across network file protocols. Recovery of any given file or perhaps even full data store recovery still beats that of tape libraries multiple times over. However, the architecture in place has allowed us to scale from the gigabyte world to the terabyte world. Our backup windows are well in hand, and spot recovery is a cinch. But there are some problematic disaster recovery scenarios.

The first scenario was just felt a week ago. A mere 50GB file store of Maildir formatted mail, where each message is a file itself, with mail folders represented by directories, had write errors on its underlying Linux XFS volume. This is by far not our largest install of such. Various mail servers for separate organizations we deal with are over 500GB in size. We suspect the RAID card's NVRAM was toast, disallowing further writes, and we had to migrate the mail to another server quickly. Simple enough, let's recover from our second tier mail store, right? The attempt was made, but we found ourselves limited not by the reading of millions of small 1K files so much as recommitting those files onto a journaled filesystem. The metadata updates of the files alone were bad enough. In the end, we were limited by file operations per second, and not pure bandwidth to the disk. Our estimated time of recovery was a minimum of 14 hours, and only for 50GB. A clue to the long term solution to this was in how we restored everything in less than 2 hours. In this case, we relied upon an xfsdump from the read-only failing array to a new filesystem on the spare hardware.

The obvious up front answer to disaster recovery of data in a multi-terabyte world is to make sure you have copies of everything in as close to a high availability setup as you can afford. If the underlying RAID array was actually two arrays, with software mirroring across the two, or if it were two separate machines that either attached to shared mirrored arrays or otherwise mirror their underlying RAID arrays over the network, we'd all just worry about natural disasters. Preventing the true disaster recovery scenario up front is the only true way to win, but most of us simply don't have the luxury, the resources, or the ability to safely migrate the myriad of production or otherwise in use solutions over to the ideal configuration. We can all try to reach this nirvana, but its simply not as attainable to most of us as we'd like.

We can, however, address some of the pain of the disaster recovery scenario from disk based solutions. The iSCSI and SAN vendors have been on this for some time, and have extolled the virtues of block based storage. Using such, you can stream I/O at near the theoretical limits of the hardware. However, running all your systems against a SAN throws you down the path of the usual hardware based solution to a general problem, with the usual vendor lock-in quibbles. We already have bought into the software based approach that NexentaStor has offered us, and happily, they already provide a similar solution to our needs. With thin-provisioning of virtualized storage volumes (zvols), one can expose block level storage to clients but still treat them as snapshot capable files, use file level services and such on the back end second level NAS. The clients will generally access these through iSCSI, and they can either directly depend on these network-based volumes as if they were local filesystems, or simply use their filesystem native dump programs to periodically maintain a near synchronized copy of a true DAS filesystem to a second tier block level copy. The latter is nice as it doesn't place undue strain on the back end storage architecture to service all clients in parallel at the fullest of performance for production. We just use network and storage resources for backup.

What does this solve? In the case of the disaster recovery, the reverse backup process can be done, getting streaming I/O rates, perhaps as high as 100+MB/sec over gigabit ethernet when the local arrays fail. In the case of my mail spool filesystem, we recovered at rates of 25-30MB/sec instead of the 500-800K/sec we saw. Even if its not the most up to date copy, if one also did file-level backup of the underlying file system to NexentaStor or the like at a faster interval, you can recover from that incrementally after the first block level recovery. Either way, you taste the sweetness of success. Again, the dirt is stuffed under the rug if this is hundreds of terabytes, and some day soon that may also be just as common place, but perhaps we are again ahead of the curve. The one side affect is that you'll want more cheap storage readily available on the second tier.

I'll quickly describe the second scenario, where a failing system also needed its 1TB of mainly larger files migrated. We saw that our top rates of file level recovery at best were 1GB/minute, but generally less. Again, it would have made sense to have been redundant up front, but the same solution above could more than double the rate of recovery if we could restore the primary file system at block speeds. This is similar to how virtual machines are managed from SAN, iSCSI, or even NFS. The VMs themselves are represented as files, and so operations on these files approach the maximum speed of block storage operations. However, having them on that NAS allows ease of sharing and management, including snapshots. No hardware tricks, all software. We haven't addressed the next stumbling blocks, which include kernel page size limitations on true file I/O, but the dirt is nicely hidden for the time being.