Thursday, May 06, 2010

NexentaStor issues?

Someone pointed out this "review" to me and asked if it was true. I ran into a similar issue. The user in that article was using the free 12TB edition without support, so perhaps that was why they didn't ask around per se or file a bug.

So, why is copying from a ZFS volume to another over rsync seemingly going on forever? I can't be sure this is the issue, but I had the same result, but this time it was going from a NetApp to a ZFS data store using NexentaStor 3.0. The problem was that the source .snapshot tree was exposed, and likely in the case of the above reviewer, their .zfs tree was exposed. I've already mentioned to the Nexenta people that its safer to have as a default exclude the terms ".snapshot | .zfs" for rsync service definitions, and let the end user override it. I too first thought it was the dedup going awry, but what I found out the problem to be on experimentation was rsync discovering those hidden paths and syncing each one. Dedup will only find duplicate blocks that line up, but the overall exposure to all those snapshots will come at some price.

If you are pulling data from one snapshot-based file system to another, it is always best to do so relative to the most recent snapshot, as you are insured data isn't changing during the synchronization, and you'll avoid falling down the snapshot well.

Wednesday, March 17, 2010

WD Caviar Green drives and ZFS (UPDATED)

We are in the process of outfitting a new primary storage system, and I was of the mind to buy more WD Caviar Green drives, specially more of the 1.5TB WDEADS drives, as we had 4 new ones already that were tested behind a slower RAID card. Before buying more, I searched the usual suspects for pricing, and found the 1TB to 2TB versions of this drive are all priced very well, even for 5400RPM drives, but they now note on different sites and/or comments that they should not be used in RAID configurations. Hmm.

I did a little more research and saw this blog post depicting how one should avoid directly integrating these drives with ZFS. I got a couple, so I decided to put them in my server with an LSI-3442E SAS backplane and tested them. First, I tested my 500GB drives in a mirror set, and doing a "ptime dd if=/dev/zero of=test1G bs=4k count=250000" on the ZFS volume made up of those drives, I transferred 1GB in 3.63 seconds, or 282MB/sec. I then immediately tried the same on my mirror set of the WD drives, benefitting from caching of the first write. After 50+ minutes of waiting, I killed the write and saw that I transferred only 426MB, at a rate of 136KB/sec.

Yes, I can confirm that these drives are less than useless in a ZFS system (see update below), even as a simple two disk mirror set. Some basic iostat showed way too much "asvc_t" service time on the disks, running from 3.5 secs to 10 secs per write, where as the service times for the working 500GB drives were 0.7msec or the like. I had various errors mpt_handle_event_sync errors in my kernel logs, so perhaps there is some specific pathology between the SAS HBA, the SAS/SATA backplane, and these disks. However, we've proven this box works well with various drives. I'm going to try yet another 1.5TB drive, likely the previously maligned Seagate drives, since I've yet to have trouble with the latest firmware on those. My 4 WD drives will be placed in enclosures for external Time Machine backups in the near future. WD Caviar Green != Enterprise RAID drives.


I'm leaving the above as is, but I think I have discovered perhaps a bad drive in the set, as when I employee 4 drives of this type I saw odd I/O patterns but ok performance in a straight RAID 0. However, I regularly have at least one drive with higher average service times, and trailing I/O writes as it catches up to the other drives. If I have these 4 drives in a pool (RAID 0), I got 193MB/sec writes, and 242MB/sec reads. Sticking them into a RAID10 (2 data, 2 mirror), I got a mirror 78MB/sec writes and 278MB/sec reads.

Splitting them off into two separate RAID1 data pools, I ran my tests and still saw high service times on the drives (only 65 or so, much better than the above, but still slow). Per mirror set performance was dismal, as I regularly got the 150MB/sec+ from a mirror of Caviar Black, but these drives got me just hit 31-34MB/sec (ie, half of the above RAID10). I guess with enough drives I'll get to better numbers in RAID10. In a RAIDZ1 (RAID5) grouping, it was 60MB/sec on the writes, and 172MB/sec on the reads.

So what accounts for the dismal performance I originally saw? I think it has to do with when multiple pools are active, and they are not all of this drive type. My original test had a Hitachi drive set as well as a WD Caviar Green drive set. Although my tests ran one at a time, I'm guessing there was some bad timing/driver issues and/or hardware issues when dealing with the mixed HD media.

A second, update conclusion is that you can use these drives, if only these drive types, in an array. RAID10 will get you sufficient performance, but otherwise you'll want to leave this to secondary storage. Future drive replacement scenarios are a real cause for concern.

Tuesday, March 02, 2010

ZFS Log Devices: A Review of the DDRdrive X1

My previous notes here have covered the trends to commodity storage, my happiness with most things ZFS and Nexenta, and how someday this will all make for a great primary storage story. At Stanford, we have a lot of disk-to-disk backup storage based on Nexenta solutions, using iSCSI or direct attached storage. We have also had some primary tier uses, but have had to play fast and loose with ZFS to get comparable performance. In essence, we sacrificed some of the ensured data integrity of ZFS to meet end users expectations of what file servers provide.

A typical thing that was done was to set these values:
set zfs:zil_disable = 1
set zfs:zfs_nocacheflush = 1

These flags allowed a ZFS appliance to perform similarly to Linux or other systems when it came to NFS server performance. When you are writing a lot of large files, the ZFS Intent Log's additional latency doesn't affect NFS client performance. However, when these same clients expect their fsyncs to be honored on the back end with mixed file sizes that trend to a large volume of small writes, we start to see pathologically poor performance with the ZIL enabled. We can measure the performance at 400KB/sec in some of my basic synthetic tests. With the ZIL disabled, I generally got 3-5MB/sec or so, or 10x the performance. That's cheating and not so safe if the client thinks a write is complete but the backend server doesn't commit it before power loss or crash.

One ray of hope previously mentioned on this site was the Gigabyte i-RAM. This battery backed SATA-I solution held some promise, but at the time I used it I found a few difficulties. First, the state of the art at that time did not allow removal of log (ZIL-dedicated) devices from pools. One had to recreate a pool if the log device failed. That raised some problems with the i-RAM. First, I had it go offline twice requiring resetting the device, essentially blanking it out and requiring re-initializing it as a drive with ZFS. Second, the connection was SATA-I only, with it not playing well with certain SATA-II chipsets or mixed with SATA-II devices. Many users had to enable it in IDE mode versus the preferred AHCI mode.

Time has passed, and new solutions present themselves. First, log devices can be added or removed from a pool at any time, on the fly. Also new to the discussion is the DDRdrive X1 product. This mixed RAM and NAND device provides for a 4G drive image with extremely high IOPS and a solution to save to stable store (NAND SLC flash) if power is lost on the PCI bus. The device itself is connected to a PCI-Express bus, with drivers for OpenSolaris/Nexenta (among others) that make it visible as a SCSI device.

I tried different scenarios with this ZIL device, and all of them make it a sweet little device. I had mixed files that I pushed onto the appliance via NFS (linux client) and found that I could multiply the number of clients and linearly increase performance. Where I would hit 450KB/sec without the ZIL device but not improve that rate by much with additional writers of data, using the ZIL log device immediately resulted in a good 7MB/sec of performance, with 4 concurrent write jobs yielding 27MB/sec. During this test, my X1 showed only a 20% busy rate using iostat. It would appear that I should get up to 135MB/sec at this rate (5x the concurrent writers), but my network connection was just gig-e, so getting anywhere near 120+MB/sec would be phenomenal. Another sample of mixed files with 5 concurrent writers pushed the non-X1 config to 1.5MB/sec, but in this case, the X1 took my performance numbers to 45-50MB/sec.

So what is providing all this performance? As I mentioned above, the fsyncs on writes from the NFS client enforce synchronous transactions in ZFS when the ZIL is not disabled. My IOPS (I/O Operations per second) without a X1 log device were measured around 120 IOPS. With the dedicated RAM/NAND DDRdrive X1 solution, I easily approach 5000 IOPS. Those commits happen quickly, with the final stable store to your disk array laid out in your more typical 128K blocks per IOP. This dedicated ZIL device has been shown to do up to 200000 IOPS in synthetic benchmarks. Lets try the NFS case one more time, in a somewhat more practical test.

Commonly, in simulation, CAD applications, software development, or the like you will be conversing with the file server committing hundreds to thousands of small file writes. To test this out and make it the worse case scenario of disk block-sized files, I created a directory of 1000 512 byte files on the clients local disk. I did multiple runs to make sure this fit in memory so that we were measuring file server write performance. I then ran 400 concurrent jobs writing this to the file server into separate target directories. First, with the dedicated ZIL device enabled, I got 24MB/sec write rates averaging 6000 IOPS. I did spike up to 43K IOPS and 35MB/sec, likely when committing some of the metadata associated with all these files and directories. Still, the X1 was only averaging 20% busy during this test.

Next, I disabled the DDRdrive X1 and tried again, hitting the same old wall. This was the pathological case. With 400 concurrent writes I still just got 120 IOPS and 450KB/sec. My only thought at the time was "sad, very sad".

You can draw your own conclusions from this mostly not-too-scientific test. For me, I now know of an affordable device that has none of the drawbacks (4K block size, wear leveling) of SSD drives for use as a ZIL device. One can now put together a commodity storage solution with this and Nexenta, and have the same expected performance without compromise as one would expect from any first tier storage platform.

That leads me to the "one more thing" category. I decided to place some ESX NFS storage-pooled volumes on this box, and compare it to the performance of the NetApps we use to manage our ESX VMs (NFS). The file access modes of the VMs tend to be similar to mixed size file operations, but they do tend to be larger writes so the ZIL may not have as drastic of an effect. Anyway, I tried it without the X1 and I got 30-40MB/sec measured disk performance from operations within the VM (random tests, dd, etc). Enabling the ZIL device, I got 90-120MB/sec rates, so we still got a 3x improvement. I couldn't easily isolate all traffic away from my NetApps, but I averaged 65MB/sec on those tests.

Here, I think the conclusion I can draw is this: The dedicated ZIL device again improved performance up to matching what I theoretically can get from my network path. The comparison one can safely make with a NetApp is not that its faster, as my test ran under different loads, but that it likely can match the line rates of your hardware and remove from the equation any concern for filesystem and disk array performance. Perhaps in a 10G network environment or with some link aggregation we can start to stress the DDRdrive X1, but for now its obvious that it enables commodity storage solutions to meet typical NAS performance expectations.