Monday, May 30, 2011

Web Frameworks / Models just aren't the same

Most of my days are spent hacking on web applications, with a strong requirement for databased-backed solutions. I've been drinking the WebObjects cool aid for quite some time, as there hasn't been a robust ORM (Object-Relational Mapping) solution that matches the maturity and the it-just-works of WebObjects' EOF layer.

However, the proverbial writing has been on the wall when it comes to Apple's continuing care taking of the public version of this technology. A lot of technologies have arisen to make the continuing effort one needs to take using WebObjects questionable, and my mind simply can't quite get around the rule engine solution of Modern Direct2Web that helps modernize WebObjects to match. Its always a question of finding the best tool for the job, and part of the toolset is one self. Am I sharp or honed enough to meet the new challenges I face? I've been both re-investing myself in WebObjects daily, but also checking out other frameworks. In almost all cases, I find again and again that they still don't match the now antiquated WO in getting things done right.

Then there is Grails. First, its not Rails, which leaves a sour taste in my mouth. But it seems to take enough from both the Java/WO and Rails worlds, some of the best and some of the worst (Servlets, bleh!). I'm also stuck dealing with both hibernate's deficiencies and Grails band aids above that. Burt Beckwith has provided multiple articles on the brain-dead dealings with collections and especially many-to-many relationships, requiring fetching of all entities to guarantee uniqueness in add and delete operations (its more an issue with BelongsTo and hasMany, original example and here's indirect implementation details). Obviously, the object graph shows some immaturity. Grails 1.4 and the underlying updates though finally get me past my fears and concerns, and so a few projects are now being built on Grails since I just can't get the quick build out of applications above the model layer I need in Modern D2W, and I require the dynamism of groovy for certain specific requirements. Again, its more finding the tool that suits me best and not the limitations of the tools.

This brings us to the meat of my posting today. EOF and Wonder's templates have spoiled me in what model code (including the generation gap pattern) is provided for me and what I expect at the model layer. I'm be trying to come to terms with both the features and lack thereof of model classes in Grails apps. Rereading the great book Grails In Action, I came to an important realization on what is missing here. Section 5.2 gets into the best practice of using Grails Services to encapsulate business logic per se and follow DRY principles. But, if one considers at least the MVC frameworks and where model logic goes, there seems to be a lot of the multi-domain logic (relationships) which never end up in Grails domains and which one needs to best handle in Services. In the end, I've come to believe that a direct mapping of WebObjects EOF models is not to Grails domains, but to Grails services instead.

With all the time in the world, I think I'd want to spend time on a plugin or template enhancements to auto-generate more complete service definitions from "grails create-service", one which takes a domain and extends it for basic operations, but builds out basic relationship management methods in the service. This would also be an ideal place to be collections aware and turn into best practice some of Beckwith's ideas. If collections were always handled in the same manner in code, it would make the complicated implementation of the correct, performant way much more trivial.

Furthermore, akin to the generation gap pattern, domains would be less tinkered with other than defining what can and should go into the database directly. This is important for managing database migrations. Instead, any and all custom logic should persist in the service. Perhaps one day Domains will get all the correct relationship handling logic that EOF superclasses generally get, and the Service is then more akin to the custom-logic-only aspect that I've come to expect of EOF subclasses for my model objects. However, I feel my mind can work with this construct to productive quickly in Grails instead of fighting against the grain or dirtying my controllers with model specific mess.

For now though, I will endeavor to always use Services extensively, and make sure any generated scaffolding takes them into account more than Domains.

Thursday, May 06, 2010

NexentaStor issues?

Someone pointed out this "review" to me and asked if it was true. I ran into a similar issue. The user in that article was using the free 12TB edition without support, so perhaps that was why they didn't ask around per se or file a bug.

So, why is copying from a ZFS volume to another over rsync seemingly going on forever? I can't be sure this is the issue, but I had the same result, but this time it was going from a NetApp to a ZFS data store using NexentaStor 3.0. The problem was that the source .snapshot tree was exposed, and likely in the case of the above reviewer, their .zfs tree was exposed. I've already mentioned to the Nexenta people that its safer to have as a default exclude the terms ".snapshot | .zfs" for rsync service definitions, and let the end user override it. I too first thought it was the dedup going awry, but what I found out the problem to be on experimentation was rsync discovering those hidden paths and syncing each one. Dedup will only find duplicate blocks that line up, but the overall exposure to all those snapshots will come at some price.

If you are pulling data from one snapshot-based file system to another, it is always best to do so relative to the most recent snapshot, as you are insured data isn't changing during the synchronization, and you'll avoid falling down the snapshot well.

Wednesday, March 17, 2010

WD Caviar Green drives and ZFS (UPDATED)

We are in the process of outfitting a new primary storage system, and I was of the mind to buy more WD Caviar Green drives, specially more of the 1.5TB WDEADS drives, as we had 4 new ones already that were tested behind a slower RAID card. Before buying more, I searched the usual suspects for pricing, and found the 1TB to 2TB versions of this drive are all priced very well, even for 5400RPM drives, but they now note on different sites and/or comments that they should not be used in RAID configurations. Hmm.

I did a little more research and saw this blog post depicting how one should avoid directly integrating these drives with ZFS. I got a couple, so I decided to put them in my server with an LSI-3442E SAS backplane and tested them. First, I tested my 500GB drives in a mirror set, and doing a "ptime dd if=/dev/zero of=test1G bs=4k count=250000" on the ZFS volume made up of those drives, I transferred 1GB in 3.63 seconds, or 282MB/sec. I then immediately tried the same on my mirror set of the WD drives, benefitting from caching of the first write. After 50+ minutes of waiting, I killed the write and saw that I transferred only 426MB, at a rate of 136KB/sec.

Yes, I can confirm that these drives are less than useless in a ZFS system (see update below), even as a simple two disk mirror set. Some basic iostat showed way too much "asvc_t" service time on the disks, running from 3.5 secs to 10 secs per write, where as the service times for the working 500GB drives were 0.7msec or the like. I had various errors mpt_handle_event_sync errors in my kernel logs, so perhaps there is some specific pathology between the SAS HBA, the SAS/SATA backplane, and these disks. However, we've proven this box works well with various drives. I'm going to try yet another 1.5TB drive, likely the previously maligned Seagate drives, since I've yet to have trouble with the latest firmware on those. My 4 WD drives will be placed in enclosures for external Time Machine backups in the near future. WD Caviar Green != Enterprise RAID drives.

UPDATE:

I'm leaving the above as is, but I think I have discovered perhaps a bad drive in the set, as when I employee 4 drives of this type I saw odd I/O patterns but ok performance in a straight RAID 0. However, I regularly have at least one drive with higher average service times, and trailing I/O writes as it catches up to the other drives. If I have these 4 drives in a pool (RAID 0), I got 193MB/sec writes, and 242MB/sec reads. Sticking them into a RAID10 (2 data, 2 mirror), I got a mirror 78MB/sec writes and 278MB/sec reads.

Splitting them off into two separate RAID1 data pools, I ran my tests and still saw high service times on the drives (only 65 or so, much better than the above, but still slow). Per mirror set performance was dismal, as I regularly got the 150MB/sec+ from a mirror of Caviar Black, but these drives got me just hit 31-34MB/sec (ie, half of the above RAID10). I guess with enough drives I'll get to better numbers in RAID10. In a RAIDZ1 (RAID5) grouping, it was 60MB/sec on the writes, and 172MB/sec on the reads.

So what accounts for the dismal performance I originally saw? I think it has to do with when multiple pools are active, and they are not all of this drive type. My original test had a Hitachi drive set as well as a WD Caviar Green drive set. Although my tests ran one at a time, I'm guessing there was some bad timing/driver issues and/or hardware issues when dealing with the mixed HD media.

A second, update conclusion is that you can use these drives, if only these drive types, in an array. RAID10 will get you sufficient performance, but otherwise you'll want to leave this to secondary storage. Future drive replacement scenarios are a real cause for concern.

Tuesday, March 02, 2010

ZFS Log Devices: A Review of the DDRdrive X1

My previous notes here have covered the trends to commodity storage, my happiness with most things ZFS and Nexenta, and how someday this will all make for a great primary storage story. At Stanford, we have a lot of disk-to-disk backup storage based on Nexenta solutions, using iSCSI or direct attached storage. We have also had some primary tier uses, but have had to play fast and loose with ZFS to get comparable performance. In essence, we sacrificed some of the ensured data integrity of ZFS to meet end users expectations of what file servers provide.

A typical thing that was done was to set these values:
set zfs:zil_disable = 1
set zfs:zfs_nocacheflush = 1

These flags allowed a ZFS appliance to perform similarly to Linux or other systems when it came to NFS server performance. When you are writing a lot of large files, the ZFS Intent Log's additional latency doesn't affect NFS client performance. However, when these same clients expect their fsyncs to be honored on the back end with mixed file sizes that trend to a large volume of small writes, we start to see pathologically poor performance with the ZIL enabled. We can measure the performance at 400KB/sec in some of my basic synthetic tests. With the ZIL disabled, I generally got 3-5MB/sec or so, or 10x the performance. That's cheating and not so safe if the client thinks a write is complete but the backend server doesn't commit it before power loss or crash.

One ray of hope previously mentioned on this site was the Gigabyte i-RAM. This battery backed SATA-I solution held some promise, but at the time I used it I found a few difficulties. First, the state of the art at that time did not allow removal of log (ZIL-dedicated) devices from pools. One had to recreate a pool if the log device failed. That raised some problems with the i-RAM. First, I had it go offline twice requiring resetting the device, essentially blanking it out and requiring re-initializing it as a drive with ZFS. Second, the connection was SATA-I only, with it not playing well with certain SATA-II chipsets or mixed with SATA-II devices. Many users had to enable it in IDE mode versus the preferred AHCI mode.

Time has passed, and new solutions present themselves. First, log devices can be added or removed from a pool at any time, on the fly. Also new to the discussion is the DDRdrive X1 product. This mixed RAM and NAND device provides for a 4G drive image with extremely high IOPS and a solution to save to stable store (NAND SLC flash) if power is lost on the PCI bus. The device itself is connected to a PCI-Express bus, with drivers for OpenSolaris/Nexenta (among others) that make it visible as a SCSI device.

I tried different scenarios with this ZIL device, and all of them make it a sweet little device. I had mixed files that I pushed onto the appliance via NFS (linux client) and found that I could multiply the number of clients and linearly increase performance. Where I would hit 450KB/sec without the ZIL device but not improve that rate by much with additional writers of data, using the ZIL log device immediately resulted in a good 7MB/sec of performance, with 4 concurrent write jobs yielding 27MB/sec. During this test, my X1 showed only a 20% busy rate using iostat. It would appear that I should get up to 135MB/sec at this rate (5x the concurrent writers), but my network connection was just gig-e, so getting anywhere near 120+MB/sec would be phenomenal. Another sample of mixed files with 5 concurrent writers pushed the non-X1 config to 1.5MB/sec, but in this case, the X1 took my performance numbers to 45-50MB/sec.

So what is providing all this performance? As I mentioned above, the fsyncs on writes from the NFS client enforce synchronous transactions in ZFS when the ZIL is not disabled. My IOPS (I/O Operations per second) without a X1 log device were measured around 120 IOPS. With the dedicated RAM/NAND DDRdrive X1 solution, I easily approach 5000 IOPS. Those commits happen quickly, with the final stable store to your disk array laid out in your more typical 128K blocks per IOP. This dedicated ZIL device has been shown to do up to 200000 IOPS in synthetic benchmarks. Lets try the NFS case one more time, in a somewhat more practical test.

Commonly, in simulation, CAD applications, software development, or the like you will be conversing with the file server committing hundreds to thousands of small file writes. To test this out and make it the worse case scenario of disk block-sized files, I created a directory of 1000 512 byte files on the clients local disk. I did multiple runs to make sure this fit in memory so that we were measuring file server write performance. I then ran 400 concurrent jobs writing this to the file server into separate target directories. First, with the dedicated ZIL device enabled, I got 24MB/sec write rates averaging 6000 IOPS. I did spike up to 43K IOPS and 35MB/sec, likely when committing some of the metadata associated with all these files and directories. Still, the X1 was only averaging 20% busy during this test.

Next, I disabled the DDRdrive X1 and tried again, hitting the same old wall. This was the pathological case. With 400 concurrent writes I still just got 120 IOPS and 450KB/sec. My only thought at the time was "sad, very sad".

You can draw your own conclusions from this mostly not-too-scientific test. For me, I now know of an affordable device that has none of the drawbacks (4K block size, wear leveling) of SSD drives for use as a ZIL device. One can now put together a commodity storage solution with this and Nexenta, and have the same expected performance without compromise as one would expect from any first tier storage platform.

That leads me to the "one more thing" category. I decided to place some ESX NFS storage-pooled volumes on this box, and compare it to the performance of the NetApps we use to manage our ESX VMs (NFS). The file access modes of the VMs tend to be similar to mixed size file operations, but they do tend to be larger writes so the ZIL may not have as drastic of an effect. Anyway, I tried it without the X1 and I got 30-40MB/sec measured disk performance from operations within the VM (random tests, dd, etc). Enabling the ZIL device, I got 90-120MB/sec rates, so we still got a 3x improvement. I couldn't easily isolate all traffic away from my NetApps, but I averaged 65MB/sec on those tests.

Here, I think the conclusion I can draw is this: The dedicated ZIL device again improved performance up to matching what I theoretically can get from my network path. The comparison one can safely make with a NetApp is not that its faster, as my test ran under different loads, but that it likely can match the line rates of your hardware and remove from the equation any concern for filesystem and disk array performance. Perhaps in a 10G network environment or with some link aggregation we can start to stress the DDRdrive X1, but for now its obvious that it enables commodity storage solutions to meet typical NAS performance expectations.


Friday, November 27, 2009

ZFS Resilver quirks and how it lies

One of my ZFS-based storage appliance was running low on disk space, and since I made it a three way stripe of mirrored disks, I could take the 6 500GB drives and replace them with 1.5TB drives each in place, with the result a major increase in capacity. Nifty ZFS software RAID feature versus typical hardware RAID setups. Its all good in theory, but resilvering (rebuilding an array pair) after replacing a drive takes quite some time. Even with only about 400GB to rebuild per drive, one sees the resilvering process cover 90% of the rebuild in 12 hours or so, but that last 10% takes another 10-12 hours. I think this has a lot to do with how snapshots or small files hurt ZFS performance, especially when you are close to a full disk. But its all just as guest as to why its slow on the tail end.

The resilver went as planned, replacing one drive after another serially, but taking care to only do one drive of a pair at a time. Near the end, I started to get greedy. With 98% done on one resilver, I detached a drive in another mirrored pair on the same volume, planning on at least placing the new drive into the chassis so I could start the final drive resilver remotely. To my surprise, the resilver restarted from scratch, so I had another 24 hours of delay to go. So, any ZFS drive removals will reset in progress scrubs/resilvers!

I then decided just to go ahead with the second resilver. This is where it got really strange. The two mirrored pairs started to resilver, and the speed was seemingly faster. After 12 hours, both pairs had about 400GB resilvered and the status of the volume indicated it was 100% done and was finishing. Hours later, it was still at 100%, but the resilver counter per drive kept climbing. Finally, after the more typical 24 hours or so, it noted it was completed.

  pool: data
state: ONLINE
scrub: resilver completed after 26h39m with 0 errors on Tue Nov 24 22:33:46 2009
config:

NAME STATE READ WRITE CKSUM
data ONLINE 0 0 0
mirror ONLINE 0 0 0
c2t1d0 ONLINE 0 0 0
c2t0d0 ONLINE 0 0 0
mirror ONLINE 0 0 0
c2t3d0 ONLINE 0 0 0
c2t2d0 ONLINE 0 0 0 783G resilvered
mirror ONLINE 0 0 0
c2t5d0 ONLINE 0 0 0
c2t4d0 ONLINE 0 0 0 781G resilvered

Yes, it looks like at least with this B104+ kernel in NexentaStor, the resilver counters lie. When you have two ongoing resilvers, each counter is nominally the total data resilvered across the whole pool. You'll thus need to wait for double the expected data amount before it completes. Thus, its very important to not reset the system until 100% turns into a "resilver completed..." statement in the status report.

Tuesday, August 18, 2009

Prepping for Snow Leopard Server and a lesson on backups

We all know that MacOSX 10.6 Server is coming out RSN. All of us who use OpenDirectory are starting to wonder about the pain that will soon endure when upgrading. Here's a few hints to keep in mind.

- Time Machine Backups do not by default restore a good MacOSX Server image. Read all about it here and learn now what will go wrong. Namely, edit the mentioned StdExclusions.plist file to remove /var/log and /var/spool from the exclusion list, and consider recreating your backups from scratch

- If you have ADC membership or otherwise can purchase WWDC 09 videos, acquire Session 622, Moving to Snow Leopard Server. Lots of good stuff there, but I'll suggest a less than perfect but simpler upgrade path

- To upgrade, use Carbon Copy Cloner or the like to make full bootable system copy on an external drive -- likely your time machine disk. At this point, you can also re-enable Time Machine to use the rest of the disk for backups using the corrected excludes list. Obviously, this disk should be far larger in size than what you have used on your OSX Server.

- You might be upgrading to a beefier 64-bit Intel configuration for your OpenDirectory master or just upgrading in place on the old hardware. I recommend using this on new hardware. Take that clone disk and boot off of it on the new box, and then clone yet again to the local disk or array. Now you can do an in place upgrade to 10.6 on non-production hardware, test, etc. Your previous master is now your first replica when you go production. If you upgrade in place, you should first test that the boot disk works as your primary first, but now you do have a full production-worthy backup disk.

- Once you past a certain point in time, I'd remove the backupdbs on that external disk (don't erase it) and reuse it for Time Machine again. You now have a way to revert to 10.5 pre-upgrade or revert to any 10.6 point in time. You should check the exclusions file before commencing Time Machine backups to make sure you are getting the expected full server backup.

- Profit

Saturday, August 02, 2008

Amanda: simple ZFS backup or S3

When I first started researching ZFS, I found it somewhat troubling that no native backup solution existed. Of course there was the ZFS send/recv commands, but those didn't necessarily work well with existing backup technologies. At the same time, the venerable open source backup solution, amanda had found a way to move beyond its limitation of maximum tape size restricting backup run size. Over time, we have found ways to marry these two solutions.

In my multi-tier use of ZFS for backup, I always need an n-tier component that will allow for permanent archiving to tape every 6 months or year, as deemed fit for the data being backed up. These are full backups only, and due to the large amounts of data in the second tier pool, a backup to tape may span dozens of tapes and run multiple days. I found I had to tweak amanda's typical configuration to allow for very long estimate times, as the correct approach to backing up a ZFS filesystem today involves tar. Amanda's approach does a full tar estimate of a backup before a real backup is attempted. Otherwise, a sufficiently tape library is all you need and a working amanda client configuration on your ZFS-enabled system.

For those following along, I'm an avid user of NexentaStor for my second tier storage solution. Setup of an amanda client on that software appliance is actually quite easy.

setup network service amanda-client edit-settings
setup network service amanda-client conf-check
setup network service amanda-client enable

That's all that one needs to do. There is a sample line in the amanda configuration that you adjust in the first command above. The line I used is similar to this:

amandasrv.stanford.edu amanda amdump

You'll find that depending on your build of amanda server, that you'll either have the legacy user name of "amanda", the zmanda default of "amanda_backup", or the Redhat default of "backup" as the user things run as. I guess there had to be a user naming conflict at some point with "amanda".

The hardest part of the configuration is finding where you have your long term snapshots. Since a backup run can take days to weeks, you'll likely wish to backup volumes relative to a monthly snapshot. In your amanda /etc/amanda/CONFIDR/disklist configuration, a sample you may have for a ZFS-based client named nexenta-nas with volume tier2/dir* is:

nexenta-nas /volumes/tier2/dir1/.zfs/snapshot/snap-monthly-1-latest user-tar-span
nexenta-nas /volumes/tier2/dir2/.zfs/snapshot/snap-monthly-1-latest user-tar-span


Note well the use of user-tar-span in the two lines above. This allows for the backing up large volumes over multiple tapes in amanda. That one limitation of tape spanning in amanda was solved in a novel way. They break up backup streams into "chunksizes" of a set size to allow for a write failure at the end of one tape to begin fresh again at the beginning of that chunk on the following tape. This feature allows amanda to also be used to backup to Amazon's S3 service. Yes, instead of going to tape, you can configure a tape server to write to an S3 service. S3 limits writes to a maximum of 2GB a file, and amanda's virtual tape solution combined with that chunk sizing of backups works wonderfully to mate ZFS-based storage solutions to S3 for an n-tier solution. Please consult Zmanda's howto for configuring your server correctly. There really is nothing left to configure to get ZFS data to S3.

Sunday, July 27, 2008

Pogo Linux, Nexenta announce StorageDirector Z-Series storage

Pogo Linux Inc., a Seattle-based storage server manufacturer, and Nexenta Systems Inc., developer of NexentaStor, an open storage solution based upon the revolutionary file system ZFS, announced Wednesday immediate availability of a new set of storage appliances featuring NexentaStor.Yeah.. that was the posted text above. What does it really mean? More kit choices to get a open storage NAS. Some nice configuration options when ordering, but I didn't see an easy was to request smaller system disks versus the rest of the data drives for any given Z series unit. Its a very good first step. If a Linux vendor adopts an appliance based on OpenSolaris (albeit a Debian/Ubuntu-lookalike), you know there is something cooking.

read more | digg story

Monday, June 16, 2008

Closet WO Developer

One of many hats I wear is that of a erstwhile java developer. Our internal apps have been heavily reliant on object relational mappers. I've dabbled into RoR, and even helped get a TurboGears project off the ground here that was fully open source. However, the primary solution we've used in production since 2002 has been the grand daddy of ORM solutions: WebObjects.

This past week at Apple's WWDC was a great one for WebObjects. The usual NDA applies. However, prior to that the WebObjects community had their own two day in-depth conference in San Francisco. No NDA for that, and I can report that WO development is alive, well, and dare I say thriving? The news about SproutCore has a second story, in that the backend of choice may be RoR, but the #1 apps will likely also be WO-based. Got an iPhone? Learn WO. As we opensource some of our projects here, I'll write a few more posts and speak on some more points, but with the latest release of WebObjects (5.4.x) the final deployment restrictions on the free WO frameworks were lifted. I expect some level of renewed interest.

To find out more, check in with the WOCommunity.

Wednesday, June 04, 2008

Recommended Disk Controllers for ZFS

Since I've been using OpenSolaris and ZFS (via NexentaStor, plug plug) extensively, I get a lot of emails asking about what hardware works best. There have been various postings on the opensolaris and zfs lists to the same effect. A lot of people reference the OpenSolaris HCL lists which leave the average user scratching their head with more questions than answers. More to the point, the HCL doesn't tend to answer the more direct question of what hardware should I get to build a ZFS box, NAS, etc. Its important to note that in the case of ZFS, all that extra checksum, fault management, and performance goodness can be negated by selecting a "supported" hardware RAID card. Worse yet, many RAID cards are not fully interchangeable on the spot. What do you want for ZFS?

First, pick any 64-bit dual core or better motherboard or processor. If you can get ICH6+, nvidia, or Si3124-based on board SATA, then you are in good shape for your basic ZFS box with on-board SATA for your system disks alone. System disk can tend to be low 5400RPM 2.5 inch SATA-I drives. Many people then desire some large memory, battery-backed RAID card, and my tests with the high end LSI SAS cards show that memory on the RAID card doesn't do you as much good as having a recipe of lots of system RAM, a sufficient number of cores, many disk drives for the spindles, and sufficient use of the PCIX/PCIe bus using JBOD only disk controllers. I'll cover the controllers next, but I'd recommend at this point 4GB of RAM minimum, dual core at greater than 2ghz, and for any good load, at least two PCI-X or multi-lane PCIe card.

Disk controllers are where the real questions are asked. Over multiples iterations, heavy use, and some anecdotal evidence, we are down to some sweet spots. For PCI-X, there is one game in town, the Marvell-based AOC-SATA2-MV8, used in the X4500. At $100 for 8 JBOD SATA-II ports, it just works and is fault managed. Stick just SATA-II disks on these, and keep any SATA-I disks on the motherboard SATA ports for system disks. I'll add that various Si3124 based cards exist here, but not for sufficient port density.

SuperMicro AOC-SATA2-MV8 link

When it comes to PCIe, there isn't any good high port count options for SATA. If you need just 2 ports, or eSATA, there are various solutions based on the Si3124 chipset, and SIIG makes many of them for $50 each. However, in the PCIe world, the real answer is SAS HBAs that connect to internal or external mixed SAS/SATA disk chassis. Again, most SAS HBAs are either full fledged RAID without JBOD support, or simply don't work in the OpenSolaris ecosystem. 3ware is a lot cause here. The true winner for both cost and performance, while providing the JBOD you want, is the LSI SAS3442E-R.

CDW catalog link for LSI 3442ER
LSI 3442ER product page

Its $250, but I've seen it as low as $130. 8 channels, with both 2 internal ports (generally 8 drives are connected to a single SAS port) as well as the external port. You can use this with an external SAS-backed array of SATA drives from Promise, for instance, to easily populate 16 or 32 drives internally, with an additional 48 drives externally, just from the one card. Would I suggest that many on that single card? No, but you can. Loading up your system with 2 or 4 of these cards, which are based on the LSI 1068 chipset that is well supported by Sun is the best way forward for scale out performance. I was given some numbers of 200MB/sec writes and 400MB/sec reads on an example 12-drive system using RAIDZ. Good numbers, as I got 600MB/sec reads on a 48-drive X4500 thumper.

If you have PCI-X, go Marvell. PCIe? Go LSI, but stick to the JBOD-capable not-so-RAID HBAs. Don't just trust me, throw a $100 or two at these and try it yourself. You'll see a better investment than $800 at the larger RAID cards. I went the latter route and have paid dearly (Adaptec, LSI, you name it). What worked from the beginning and is working today are the Marvell cards here, and I've been playing with new systems that use the LSI 3442ER.

Saturday, May 31, 2008

Mixing SATA dos and donts

Another day, another bug seemingly hit. I've known for some time that mixing SATA-I and SATA-II devices on the same controller with regards to OpenSolaris seems to be unwise. I've already had systems with the initial ZFS-boot drive being a small capacity and thus likely SATA-I, but the data volumes were SATA-II. My recent issues with the iRAM could be related to having a SATA-I device after a SATA-II drive in the chain, but nothing has been concrete.

However, today I discovered something else. One array I have is made up of all SATA-I drives and was used by a SATA-I RAID card that went south. I happily replaced it with the Marvell SATA-II JBOD card, and it was working just fine. I then lost the 6th of 7 drives, and went back to the manufacturer to try and buy a replacement. Sadly, these Raid Edition drives have been "updated" to be at a minimum of SATA-II for the same model. Replacing the failed SATA-I with the SATA-II worked, but on subsequent reboots, the 7th drive tended to not be enumerated by the Marvell card at startup, and even after re-inserting it, a "cfgadm" was necessary to activate it. Even then, a "zpool import" or "format" to introspect the now configured drive would wedge and never complete the command. Weird, right?

The solution to return to stability was to swap the 6th and 7th drive, so that the SATA-II disk came after all the SATA-I devices in the chain. I'm not sure why it works, but every reboot works now, it never fails to enumerate that last drive and there is no need to manually cfgadm configure the drive post boot. Therefore, a set of truisms are starting to come together with mixed SATA drives. Whether Marvell, Sil3124, or the like, its never a good idea to mix SATA-I and SATA-II devices on a single controller, but if necessary, make sure that the SATA-II drives come after the SATA-I drives. The best configuration is to restrict SATA-I boot devices, such as small 5400 "laptop" drives to their own onboard SATA interface, and leave all SATA-II devices to add-on boards.

Tuesday, May 27, 2008

The problem with slogs (How I lost everything!)...

A while back, I spoke of the virtues of using a slog device with ZFS. The system I went into production with had an Nvidia-based SATA controller onboard and a Gigabyte i-RAM card. No problems there, but at the time it was a cmdk driver (PATA mode) for my OpenSolaris-based NexentaStor NAS. After a while, I got an error where the i-RAM "reset" and the log went degraded. The system simple started to use the data disks for the intent log. So, no harm done. Its important to note that the kernel was a B70 OpenSolaris build.

Later, I wanted to upgrade to NexentaStor 1.0, which had B85. Post upgrade or using a boot CD, it would never come up with the i-RAM attached. The newer kernel was an nv_sata driver, and I could always get it to work in B70, so I reverted to that. This is one nice feature that Nexenta has had for quite some time, in that the whole OS is checkpointed using ZFS to allow reversions if an upgrade doesn't take. Well, the NAS doesn't like having a degraded volume, so I've been trying to "fix" the log device. Currently, in ZFS, log devices cannot be removed, but only replaced. So, I tried to replace it using the "zpool" command. Replacing the failed log with itself always fails as its "currently in use by another zpool". I figured out a way around that, and that was to fully clear the log using something like "dd if=/dev/zero of=/dev/mylogdevice bs=64k". I was able to upgrade my system to B85, and then I attempted to replace the drive again, and it looked like it was working:


pool: data
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scrub: resilver in progress for 0h0m, 0.00% done, 450151253h54m to go
config:

NAME STATE READ WRITE CKSUM
data DEGRADED 0 0 0
raidz1 ONLINE 0 0 0
c7t0d0 ONLINE 0 0 0
c7t1d0 ONLINE 0 0 0
c6t1d0 ONLINE 0 0 0
logs DEGRADED 0 0 0
replacing DEGRADED 0 0 0
c5d0 UNAVAIL 0 0 0 cannot open
c8t1d0 ONLINE 0 0 0

errors: No known data errors


Note well, that it is replacing one log device with another (using the new nv_sata naming). However, after it reached 1% it would always restart the resilver with no ZFS activity, no snapshots, etc. The system was busy resilvering and resetting, getting no where. I decided to reboot to B70, and as soon as that came up, it started to resilver immediately and it proceeded after quite a long time for a 2GB drive to complete the resilver. So, everything was now fine, right?

This is where things really went wrong. At the end of the resilver, it still considered the volume degraded, and looked like the above output but with only one log device. Rebooting the system, the volume started spewing out ZFS errors, and checksums counters went flying. My pool went offline. Another reboot, this time with the log device disconnected due to nv_sata not wanting it connected for booting purposes causes immediate kernel panics. What the hell was going on? Using the boot cd, I tried to import the volume. It told me that the volume had insufficient devices. A log device shouldn't be necessary for operation, as it hadn't needed it before. I attached the log device and ran cfgadm to configure it, which works and gets around the boot time nv_sata/i-RAM issue. Now it told me that I have sufficient devices, but what happened next was worse. The output showed that my volume consisted of one RAIDZ, an empty log device definition, and additionally my i-RAM as an additional degraded drive added to the array as a stripe! No ZFS command was run here. It was simply the state of the system relative to what the previous resilver had accomplished.

Any attempt to import the volume fails with a ZFS error regarding its inability to "iterate all the filesystems" or something to that affect. I was able to mount various ZFS volumes read-only by using the "zfs mount -o ro data/proj" or similar. I then brought up my network and manually had to transfer the files off to recover, but this pool is now dead to the world.

What lessons have I learned? Slog devices in ZFS, though a great feature, should not be used in production until they can be evacuated. There may be errors in the actions I took above, but bugs that I see include the inability for the nv_sata driver to deal with the i-RAM device for some odd reason, at least in B82 and B85 (as I've so far tested). The other bug is that a log replace appears to either not resilver at all (B85) or, when resilvering in older releases, causes the system to not correctly resilver the log but instead to shim the slog in as a data stripe. I simply can't see how that is by any stretch of the imagination by design.

Saturday, May 03, 2008

ZFS: Is the ZIL always safe?

One of my ZFS-based appliances, used for long term backup, was upgraded from B70 to B85 of OpenSolaris two weeks ago. This time around, I re-installed the system to get RAIDZ2, and certain "hacks" that I've been using were no longer in place. The old settings were in /etc/system, and are the well known zil_disable and zfs_nocacheflush enabling. They were left there from when the system temporarily acted as a primary server for a short time with its Adaptec (aac) SATA RAID card and its accompanying SATA-I drives. Since the unit was UPS attached, it was relatively safe for NFS client access, and later on there was no direct client access over NFS. No harm done, and stable for quite some time over multiple upgrades from B36 or so, over a year without an error.

A curious thing happened as soon as I upgraded without these somewhat unsafe settings for the kernel. I started to get tons of errors and twice my pool as gone completely offline until I cleared and scrubbed it. An example of the errors:

NAME STATE READ WRITE CKSUM
tier2 DEGRADED 0 0 0
raidz2 DEGRADED 0 0 0
c1t1d0 FAULTED 0 64 0 too many errors
c1t2d0 DEGRADED 0 46 0 too many errors
c1t3d0 DEGRADED 0 32 0 too many errors
c1t4d0 DEGRADED 0 47 0 too many errors
c1t5d0 DEGRADED 0 39 0 too many errors
c1t6d0 FAULTED 0 118 0 too many errors
c1t7d0 DEGRADED 0 57 0 too many errors


Nothing explained the turnaround from stable to useless for any writes. I also got some read errors, and no nightly rsync against this tree would survive without incrementing some error count. Was it somehow one of my cache settings on the adaptec card that conflicted with a new version of the "aac" driver? I thought I would need to isolate it, revert perhaps, or consider that somehow my card was simply dying. Perhaps the cache/RAM on the card itself was toast.

A recent post on the opensolaris-discuss mailing lists gave me an idea. Mike DeMarco suggested to a user suffering from repeated crashes that corrupt ZFS until cleared to try and use zil_disable to test "if zfs write cache of many small files on large FS is causing the problems." Makes some sense if the card is somehow trashing on small writes. The use of it for backup means that its being read and written to via rsync and can involve many small updates. I also had various read errors pop up. So, I put the old faithful zil_disable and for good measure the zfs_nocacheflush back after another degraded pool, and after a reboot and scrub, let it do its nightly multi-terabyte delta rsyncs. After a few days, there are no errors. Have I stumbled onto some code path bug that is ameliorated by these kernel options? Do newer kernels have suspect aac drivers?

Perhaps someone will prove the logic of the above all wrong, but for now, I'm returning to the old standby "unsafe" kernel options to keep my pool stable.

Thursday, April 03, 2008

OpenDirectory upgrade path from 10.4 to 10.5

In EE we've migrated over from various AD and OpenLDAP installations to what we hope is a more manageable solution long term. Sadly, upgrading OpenDirectory (MacOSX OpenLDAP-based directory services) from 10.4 to 10.5 doesn't work as Apple states it would. Here's the complete recipe we used to keep our data, our passwords, and most importantly, our domain SID. Apple tends to not care about maintaining the SID in various replica-to-master promotion steps.

First, a reference to the cookbook  doing things the hardway.

As recommended in the above and from other postings, upgrades do not work. Rather, what needs to be done is this:

10.4 Server:

1) go to Server Admin, OpenDirectory, and under the Archive tab, generate an archive of the OpenDirectory DB. Place in admin home directory
2) For safe keeping, go to /var/db/samba and get the secrets.tdb file. Place in admin home directory (readable by all)
3) get the current SID by running as root/sudo "net getdomainsid EE" where EE is the domain we are supporting. Place in home directory
4) copy off to a 3rd party machine the above three files/directories

10.5 Server:

1) Install fresh, and use the exact same IP and name as the 10.4 Server. You'll likely need to have these are their own net. Also note that without a link on the primary interface, smb, dns, and opendirectory don't work. I suggest connecting to the third party machine listed above, in my case my laptop's physical connection which I assign to the private net
2) You'll need DNS setup temporarily, so create a DNS server for your domain (stanford.edu) and create a host entry for your self. Point local network settings to self as DNS server
3) copy over the files saved from 10.4 from the laptop/3rd party machine
4) Make an OpenDirectory Master, using the correct domain "dc=ee,dc=stanford,dc=edu" and correct KRB realm "EE.STANFORD.EDU"
5) import the archive of 10.4
6) run as root "mkpassdb -kerberize"
7) Create a new PDC config for Windows. Use the directoryadmin account/password to give samba correct access to the OpenDirectory DB
8) edit /var/db/smb.conf to fit the /etc/smb.conf entries you had on 10.4. Likely you'll want to make "local path = " and add "admin users = directoryadmin, domainjoin, @admin" or the like, where the first is the directory admin acct, the second is a PDC join account that can't login, but has directory admin rights. @admin works to include anyone in admin group
9) run as root "chflags uchg /var/db/smb.conf" to freeze your samba config. Recommend making a copy as well in the same dir.
10) run as root "net setdomainsid (SID)" where SID is the one you saved from 10.4
11) Go into Workgroup Manager. Change preferences to enable Inspector. Go into Inspector and select "Config" and then "CIFSServer". The two Value lines with "xml version.." need to have Edit run against them, and replace the SID line in each with the SID you just used.
12) restart Samba/Windows services. Check SID with, as root, "net getdomainsid" and "net getlocalsid EE" or the like. If anything didn't stick, do 10, 11 again.
13) before going live, one needs to remove reference to the local DNS in Network preferences, and optionally disable DNS service. This setup also was only tested with Wins service enabled as the WINS Server
14) test, test, test from Windows including domain logins, enumeration of groups in windows for adding domain users, etc. Logs may show if accounts are failing.

On Windows, the simple tests you can do involve the utility "nltest" which is in the free SUPPORT TOOLS (but may not be installed by default). nltest /? gives commands although OS-X samba only supports some of them.

..to list PDC and BDCs --- nltest /dclist:your_domain

nltest /dclist:ee
Domain 'ee' is pre Windows 2000 domain. (Using NetServerEnum).
List of DCs in Domain ee
\\EE-OD (PDC)
The command completed successfully

..to verify schannel --- nltest /sc_query:your_domain
C:\>nltest /sc_query:ee
Flags: 0
Trusted DC Name \\EE-OD
Trusted DC Connection Status Status = 0 0x0 NERR_Success
The command completed successfully

To do a more detailed check, you can open the Windows Manager and try to look at the members of the Administrator group for the machine. When we had trouble, it just showed raw SID numbers, even for EE\DomAdmins. Once it was fixed, then that showed correctly.

Error cheat sheet:

1. If smb logs show that directoryadmin or domainjoin and the like have the "wrong sid" in passdb, you'll need to demote/promote Windows Servers to workgroup and back to PDC. You'll need to run "chflags nouchg /var/db/smb.conf" first and copy back your copied version after repromotion as the file will be rewritten. Do step 9-12 again above

2. If kerberos isn't effectively working on clients, you may need to reimport the archive OpenDirectory, rerun "mkpassdb -kerberize" and follow the above demote/promote steps.

Have NAS, Want Shell

Now that anyone can grab Nexenta's NAS product, many will undoubtedly want to get under the hood, especially developers. First, a fair warning that although the management infrastructure is resilient to many changes done manually, modifying various service configurations outside of Nexenta's internal version control can lead to one or two headaches if you aren't careful. That said, give me a shell!

Well, that's simple. When you login via the console (ssh, for example), simply run "setup appliance nmc edit-seettings". You can tab your way through that command as well. Once there, go and edit expert_mode to be "1". Yes, you've enter the "vi" command zone, so save and exit with ':wq'

Once the changes are saved, you'll be asked to refresh the console settings, and now you can type "!bash" to get a nice usable shell, or bang escape any command. You'll be root, so, be aware and behave! Now you know what Nexenta Core was all about, as its all there at your fingertips, along with NMS, NMC, and NMV subsystems that are the foundation of the NAS product.

update:
I was told that an alternative way to set expert mode is
option expert_mode = 1 -s
as denoted in the "option -h" documentation for NMC. The "-s" flag updates the on-disk configuration.

Developers, developers, developers...

Ever wanted that NAS on your own hardware, for free? Nexenta has finally released their NexentaStor Developer Edition 1.0, which is free version of their commercial product with only a 1TB limit on used storage. All functionality otherwise is there, unlimited. This is a near final release for the commercial version, but is the first version the general public can get and install on their own hardware.

The release represents many things, but the Developer releases are focused on more than just tire kicking or a free NAS product for your home NAS needs. Rather, there is a lot of potential to extend and use Nexenta's SA-API for storage service-enabled solutions. Wish to modify your DB to wrap a transaction in a snapshot? Need to automate separate file system creation, quotas, etc for your users? Registered users on the web site can look at an overview of the architecture and sample SA-API components. I expect much more in the way of API details in the near future. However, the release of the product is here today.

A general support forum is also available

There are two other automation aspects to NexentaStor that I haven't given much love to here. Both utilize the batch nature of NMC, the Nexenta Management Console. One is the 'query' functionality, which allows various introspections on the NAS and can query across multiple appliances at once if they are grouped together (the group function). In a similar vein, there is the NMC recording facility, which is handled by the "record" command. Recording allows you to save and play back actions for various tasks, including over a network of NAS devices. All of these commands have ready examples available by invoking the command with the "-h" help argument in the console. There is also good stuff in the User Guide which is available for download.

Friday, March 21, 2008

Step by Step CIFS Server setup with OpenSolaris

After CIFS Server was released into the OpenSolaris wild, I could not for the life of me get it to work. Even in the post B82 stage, the random collection of documentation led me astray multiple ways. I think part of the problem is that I read up on it too much and thus old requirements were no longer accurate and got in the way. You need to setup your krb5.conf file right? LDAP too? The final resolution appears to be rather straight forward, and it likely shows other steps I had taken previously were left rotting on my system and prevented a working solution.

So, what do you actually need? I'd recommend starting with at least B85. In my case I used the latest NexentaOS unstable release (1.0.1 to be) which includes B85 and by default the necessary Sun smb packages. For my test, I created a contrived domain using Windows 2003 Server (SP2) called WIN.NEXENTA.ORG. The rest follows:

add to /etc/resolv.conf:
nameserver 172.24.101.71
domain win.nexenta.org
search win.nexenta.org
(Nameserver is our AD DNS server)

(optional: run ntpdate against your time server)
#svcadm enable svc:/network/ntp:default
#svcadm enable -r smb/server
#smbadm join -u Administrator win.nexenta.org

#zfs set sharesmb=on data/myshare
#zfs set sharesmb=name=myshare data/myshare

#mkdir /data/myshare/jlittle
#chown jlittle /data/myshare/jlittle

#idmap add 'winuser:*' 'unixuser:*'
#idmap add "wingroup:Domain Users' 'unixgroup:staff'

#svcadm restart smb/server
#svcadm restart idmap

Other advisable steps include "zfs set casesensitivity=mixed data/share" for correctness of Windows users, but likely not ideal if the zfs filesystem shared is also shared to NFS clients. You know if its all working if "idmap dump" gives you real values and not just returns to the prompt. I connected to my new share via a MacOSX client, and made sure my domain matched as "win.nexenta.org" when connecting to my share (aka smb://server/myshare/jlittle).

In the end, it was much simpler than the documents suggested. I had to avoid explicitly stating the domains in idmap to make idmap do the right thing. You should pick the right local group for your users in the mapping for groups. I picked "staff" as that was the default group of my user.

Monday, March 03, 2008

Random Storage Comments, Answered

In my last posting, a lot of comments covered wide and varied ground. First, its important to note that even with CDP underlying ZFS pools, ZFS itself provides for its own integrity of state.  If CDP didn't complete a transaction, a re-sync will generally resolve it, but the actual hosted ZFS filesystem need not fear and its transactions won't be finished until the write is checksumed. I agree that there are failure modes here, but that leads to a good quote in one of the comments:

"To that end, it seems to be that whenever a choice can be had between doing something simple to accomplish a goal and chaining a bunch of parts together to accomplish the same goal with more sophistication, its likely the simpler solution will be more sustainable over time."

I concur. Nexenta marries two pieces of functionality to get auto-cdp, and they rely on the two components in whole to maintain overall simplicity of implementation. The real value that they have provided is in making the front end dead simple. If the management isn't simplified, any level of underlying functionality will be lost in the long run.

I want to focus more on the simplicity of the performant NAS solutions. Mentioning pNFS, lustre, and the like, we know that the client becomes a bit less transparent, and definitely the backend store of data becomes somewhat opaque as data is no longer consistent per one server, but is spread out across the whole back end. Even though you need newer clients with specific functionality in both cases, it can again be more simple than the alternative, which generally involves an NFS v3 client using automounts, LDAP-based auto mount maps, and heavy handed data management on the backend to scale out in similar ways over multiple NAS heads. The tact of taking a single high end head with best of breed backend hardware, such as IB interconnects to SAS disk arrays and 10GB ethernet out the front might seem to work, but we have already seen pathological conditions where a single heavy client writing millions of small files can make that enhanced hardware meaningless for performance.


There is no fast answer to solving both scale out with regards to capacity and performance without a little give on each aspect of the design. What makes it all reasonable to consider is if the entire solution is made greatly more simple to manage than the alternatives at either end of the design spectrum. In the end, simplicity of manageability will trump other considerations. As long as simplicity is strictly maintained in the product, the underlying complexity will seem well worth the effort. We just need to trust that someone gets the fine details. In the end, we don't mind that we can't muck much with a highly efficient but high performance car. As long as we feel mastery over its operation and trust in the quality of the build and service by the manufacturer, we all are willing to make the investment.

Thursday, February 07, 2008

The ZFS scaling and DR question

In my dealings with using ZFS-based NAS and 2nd tier solutions, I've been blessed to hear from different people with thoughts that push the discussion forward. The ZFS space is where a lot of long term strategies that utilized commoditized components are covered. Other spaces that I follow seem to be somewhat stale, or consider specialized point systems that solve problems between just two or three chosen vendors. ZFS is open, so I think its ecosystem can only grow. I'm happy to have permission from Wayne Wilson to re-state one of his emails, and use it for a discussion point.




I see two different paths to scale:

1) Use a single system to run NexentaStor and have it mount
the remaining systems via iSCSI.  This would probably have to
be done over 10 Gbs or infiniband links.

2) Create Multiple NexentaStor systems and only present them
as a unified file system via NFS V4.  This would leave the
CIFS clients restricted to mounting multiple volumes, but that
may be ok.


Is this it or is there some other way?

Next architectural issue is how to do DR and archiving.

There are two types of DR - Human initiated and machine initiated.

My standard strategy for the Human initiated DR is to use
snapshots and keep enough of them around to answer most restore
requests. For machine initiated, my worst case is when the
storage subsystem (either a complete Thumper or a complete array)
fails.  For this I can find no other solution than to replicate.

  As you have pointed out, replication usually locks you
into the backend storage vendors system, whereas it would be
better for us consumers to be able to 'mirror' across two
disparate storage backends.

Here is where things might fall apart in using Thumpers. We
could probably spec out a really high I/O throughput 'head end'
type server to load Nexenta on.  Then we could present any kind
of storage as iSCSI LUN's to the system. Let's say we use a
Thumper 48TB system to as an iSCSI target for our head end, then
we could use an Infortrends (or some such) SATA/iSCSI array for
other LUN's and let ZFS mirror across them.......and rely on
snapshots for human based DR and mirror failure protection for
machine based DR.


Then that leaves us with Archiving. I think that here is where
a time based tier, or at least the ability to define a tier based
on age would be useful.  If we set the age to a point beyond which
most changes are taking place (letting snapshot's take the brunt
of the change load before archiving), then it is likely that we
will have just one copy of the data to archive off, for most files.

What we would want to do is the make tape the archival tier. I am
uncertain as to how to do this.  Should it be done using vendor
backup software to allow catalogs and GUI retrievals?



Wayne covers how to scale out NAS heads based on ZFS, as well as the standard DR and archive question.

Considering the NAS head scenario, it is plain that long term, running storage across multiple NAS heads using upcoming technologies as they mature, such as Lustre or pNFS, will be necessary to scale out with both performance and capacity. However, it is reasonable to consider solutions that utilize DAS/IB/external SAS/iSCSI to take one head node and approach petabyte levels. I consider this within reason if the target is second tier or digital archive storage, where performance isn't king. With the best of hardware, perhaps the single head node (or HA config) will have sufficient performance for most primary storage deployments. Time will tell as our needs require such solutions and technologies we have employed improve.

Disaster Recover I brought up in a recent post. Using file based backup solutions works well in the act of backing up, but restoration at a large scale is wanting especially if file numbers become more dominant than file size. The next beta of NexentaStor happily has taken a large leap forward in addressing this by implementing a very simply to manage auto-cdp (continuous data protection) service across multiple NAS heads. This keeps multiple storage solutions in sync as data is committed, operates below the ZFS layer, and is bidirectional. Yes, the secondary system can import the same LUNs or ZFS pools and re-export them to your clients. Just as important, if you lost the primary host, synchronization can be reversed to restore at full block level speeds your primary system.

If you take this approach, and also consider exposing native zvols or NFS/CIFS to your clients (such as a mail server), they too can use their local DAS storage under any OS and filesystem, but they can use native backup solutions to the ZFS exported volumes to regularly backup block-level dumps to allow speedy block-level restores. A mixture with this and file level backups even permits less frequent full dumps and greater granularity in recovery. In the end, you'd hope to have these wonderful features on your server OS directly to prevent having to do DR, but you can see that we are approaching reasonable answers.

The final issue brought up is archival, and I hope my previous posts have gone far in answering it. In general, I believe disk based archival solutions need to be employed before tape is considered, and tape should be fully regulated to final archival stages only. Today, you can use multiple open (Amanda/Bacula) and closed backup software solutions to write to tape libraries from trailing edge snapshots. I also know that though in its infancy, the NDMP client services evolving for ZFS will someday allow easier integration into current backup systems, allowing most people to convert existing tape based solutions completely into their last tier archive, running infrequently for long periods with just full backups.

All the above is just my "its getting better" perspective. Perhaps you can find some glaring weakness. I hope shortly you can all see the auto-cdp service that Nexenta has put together in action. Its well worth the wait.

Thursday, January 17, 2008

Using the iRam: Improving ZFS perceived transaction latency

I've been long overdue in reviewing the Gigabyte iRam card and its affect on performance of your favorite ZFS NAS product. NexentaStor already supports log devices, so the time appeared right to get one of these for a client I consult with to help deal with the noticeable pauses one can see when heavy reads and writes compete on a ZFS pool. I hope that the single threaded nature of those commits is resolved at some future point, but the iRam card appears to be a simple way to inject an NVRAM-like device into your commodity NAS solution.

The card itself is simply four DIMM sockets for DDR RAM, with a battery backup, reset switch, power driven from a PCI bus, and a single SATA-I connection to plug the unit into your existing SATA interfaces. Already you can see that the performance limit is 150MB/sec based on the SATA-I spec. What does this card do though? Near instant reads and writes in a safe battery-backed ramdisk that your system sees as a 2GB or 4GB drive, just what you'd want for a dedicated write commit device. In the case of many spindles in an array, you likely can do better than this device for true performance, but in the case of many small commits, the near perfect latency of RAM is much more ideal to keep writes happening without stalling the drives for reads. Since its a "slog" device by ZFS terms, it will regularly commit to the real underlying storage at full disk bandwidth. Therefore, even when writes must compete with reads on the physical disk, you limit your exposure to perceived stalls in I/O request even in the higher load cases.

For my non-production test, I actually put together the worse case scenario: An iSCSI backed ZFS array with NFS clients and many small files. In this case, any NFS writes require 3 fsyncs on the back end storage as required by NFS (create,modify,close). This is actually similar to CAD libraries, which the test was made to reflect. Using iSCSI devices, you can inflict much higher latencies. My iSCSI targets are actually older SATA-I drives themselves on a SBEi Linux based target using 3ware 8500s. Again, no where near ideal.

Creating a directory of 5000 small 8k files, I copied this from a linux gig-e connected client to a ZFS pool (made of two non-striped iSCSI luns), and got a meager 200K/sec write performance over NFS. If I stripe the data instead in the ZFS pool, I increased the numbers to 600K/sec at some points. Adding a 2GB Gigabyte iRam drive, I increased those numbers up to 9MB/sec, but averaging around 5MB/sec overall. That's at least 10 times the performance. Again, this test involves many i/o operations instead of using any bandwidth.

How fast can data be written to and read from that log device? My tests showed that 100MB/sec for reads and writes were common, with writes only bursting to those numbers for larger streaming data sets. In the case of the iSCSI nodes in question, each one could be pulled at a top rate of 45MB/sec, but averaging closer to 27MB/sec. Nominally, you can see that we are 3x better than at least these gig-e iSCSI devices.

The final production installation of the iRam device was with a SATA-II DAS array, and even in heavier load scenarios, we saw the wait cycle for write commits to the drives limited, and a steady 100+MB/sec use of the commit log (reads and writes). The only caveat for using such a device is that the current builds of OpenSolaris and thus NexentaStor do not allow you to remove it once added to a pool. A future release is supposed to address that.

Followers