Saturday, August 02, 2008

Amanda: simple ZFS backup or S3

When I first started researching ZFS, I found it somewhat troubling that no native backup solution existed. Of course there was the ZFS send/recv commands, but those didn't necessarily work well with existing backup technologies. At the same time, the venerable open source backup solution, amanda had found a way to move beyond its limitation of maximum tape size restricting backup run size. Over time, we have found ways to marry these two solutions.

In my multi-tier use of ZFS for backup, I always need an n-tier component that will allow for permanent archiving to tape every 6 months or year, as deemed fit for the data being backed up. These are full backups only, and due to the large amounts of data in the second tier pool, a backup to tape may span dozens of tapes and run multiple days. I found I had to tweak amanda's typical configuration to allow for very long estimate times, as the correct approach to backing up a ZFS filesystem today involves tar. Amanda's approach does a full tar estimate of a backup before a real backup is attempted. Otherwise, a sufficiently tape library is all you need and a working amanda client configuration on your ZFS-enabled system.

For those following along, I'm an avid user of NexentaStor for my second tier storage solution. Setup of an amanda client on that software appliance is actually quite easy.

setup network service amanda-client edit-settings
setup network service amanda-client conf-check
setup network service amanda-client enable

That's all that one needs to do. There is a sample line in the amanda configuration that you adjust in the first command above. The line I used is similar to this:

amandasrv.stanford.edu amanda amdump

You'll find that depending on your build of amanda server, that you'll either have the legacy user name of "amanda", the zmanda default of "amanda_backup", or the Redhat default of "backup" as the user things run as. I guess there had to be a user naming conflict at some point with "amanda".

The hardest part of the configuration is finding where you have your long term snapshots. Since a backup run can take days to weeks, you'll likely wish to backup volumes relative to a monthly snapshot. In your amanda /etc/amanda/CONFIDR/disklist configuration, a sample you may have for a ZFS-based client named nexenta-nas with volume tier2/dir* is:

nexenta-nas /volumes/tier2/dir1/.zfs/snapshot/snap-monthly-1-latest user-tar-span
nexenta-nas /volumes/tier2/dir2/.zfs/snapshot/snap-monthly-1-latest user-tar-span


Note well the use of user-tar-span in the two lines above. This allows for the backing up large volumes over multiple tapes in amanda. That one limitation of tape spanning in amanda was solved in a novel way. They break up backup streams into "chunksizes" of a set size to allow for a write failure at the end of one tape to begin fresh again at the beginning of that chunk on the following tape. This feature allows amanda to also be used to backup to Amazon's S3 service. Yes, instead of going to tape, you can configure a tape server to write to an S3 service. S3 limits writes to a maximum of 2GB a file, and amanda's virtual tape solution combined with that chunk sizing of backups works wonderfully to mate ZFS-based storage solutions to S3 for an n-tier solution. Please consult Zmanda's howto for configuring your server correctly. There really is nothing left to configure to get ZFS data to S3.

Labels: , ,

Sunday, July 27, 2008

Pogo Linux, Nexenta announce StorageDirector Z-Series storage

Pogo Linux Inc., a Seattle-based storage server manufacturer, and Nexenta Systems Inc., developer of NexentaStor, an open storage solution based upon the revolutionary file system ZFS, announced Wednesday immediate availability of a new set of storage appliances featuring NexentaStor.Yeah.. that was the posted text above. What does it really mean? More kit choices to get a open storage NAS. Some nice configuration options when ordering, but I didn't see an easy was to request smaller system disks versus the rest of the data drives for any given Z series unit. Its a very good first step. If a Linux vendor adopts an appliance based on OpenSolaris (albeit a Debian/Ubuntu-lookalike), you know there is something cooking.

read more | digg story

Monday, June 16, 2008

Closet WO Developer

One of many hats I wear is that of a erstwhile java developer. Our internal apps have been heavily reliant on object relational mappers. I've dabbled into RoR, and even helped get a TurboGears project off the ground here that was fully open source. However, the primary solution we've used in production since 2002 has been the grand daddy of ORM solutions: WebObjects.

This past week at Apple's WWDC was a great one for WebObjects. The usual NDA applies. However, prior to that the WebObjects community had their own two day in-depth conference in San Francisco. No NDA for that, and I can report that WO development is alive, well, and dare I say thriving? The news about SproutCore has a second story, in that the backend of choice may be RoR, but the #1 apps will likely also be WO-based. Got an iPhone? Learn WO. As we opensource some of our projects here, I'll write a few more posts and speak on some more points, but with the latest release of WebObjects (5.4.x) the final deployment restrictions on the free WO frameworks were lifted. I expect some level of renewed interest.

To find out more, check in with the WOCommunity.

Labels: ,

Wednesday, June 04, 2008

Recommended Disk Controllers for ZFS

Since I've been using OpenSolaris and ZFS (via NexentaStor, plug plug) extensively, I get a lot of emails asking about what hardware works best. There have been various postings on the opensolaris and zfs lists to the same effect. A lot of people reference the OpenSolaris HCL lists which leave the average user scratching their head with more questions than answers. More to the point, the HCL doesn't tend to answer the more direct question of what hardware should I get to build a ZFS box, NAS, etc. Its important to note that in the case of ZFS, all that extra checksum, fault management, and performance goodness can be negated by selecting a "supported" hardware RAID card. Worse yet, many RAID cards are not fully interchangeable on the spot. What do you want for ZFS?

First, pick any 64-bit dual core or better motherboard or processor. If you can get ICH6+, nvidia, or Si3124-based on board SATA, then you are in good shape for your basic ZFS box with on-board SATA for your system disks alone. System disk can tend to be low 5400RPM 2.5 inch SATA-I drives. Many people then desire some large memory, battery-backed RAID card, and my tests with the high end LSI SAS cards show that memory on the RAID card doesn't do you as much good as having a recipe of lots of system RAM, a sufficient number of cores, many disk drives for the spindles, and sufficient use of the PCIX/PCIe bus using JBOD only disk controllers. I'll cover the controllers next, but I'd recommend at this point 4GB of RAM minimum, dual core at greater than 2ghz, and for any good load, at least two PCI-X or multi-lane PCIe card.

Disk controllers are where the real questions are asked. Over multiples iterations, heavy use, and some anecdotal evidence, we are down to some sweet spots. For PCI-X, there is one game in town, the Marvell-based AOC-SATA2-MV8, used in the X4500. At $100 for 8 JBOD SATA-II ports, it just works and is fault managed. Stick just SATA-II disks on these, and keep any SATA-I disks on the motherboard SATA ports for system disks. I'll add that various Si3124 based cards exist here, but not for sufficient port density.

SuperMicro AOC-SATA2-MV8 link

When it comes to PCIe, there isn't any good high port count options for SATA. If you need just 2 ports, or eSATA, there are various solutions based on the Si3124 chipset, and SIIG makes many of them for $50 each. However, in the PCIe world, the real answer is SAS HBAs that connect to internal or external mixed SAS/SATA disk chassis. Again, most SAS HBAs are either full fledged RAID without JBOD support, or simply don't work in the OpenSolaris ecosystem. 3ware is a lot cause here. The true winner for both cost and performance, while providing the JBOD you want, is the LSI SAS3442E-R.

CDW catalog link for LSI 3442ER
LSI 3442ER product page

Its $250, but I've seen it as low as $130. 8 channels, with both 2 internal ports (generally 8 drives are connected to a single SAS port) as well as the external port. You can use this with an external SAS-backed array of SATA drives from Promise, for instance, to easily populate 16 or 32 drives internally, with an additional 48 drives externally, just from the one card. Would I suggest that many on that single card? No, but you can. Loading up your system with 2 or 4 of these cards, which are based on the LSI 1068 chipset that is well supported by Sun is the best way forward for scale out performance. I was given some numbers of 200MB/sec writes and 400MB/sec reads on an example 12-drive system using RAIDZ. Good numbers, as I got 600MB/sec reads on a 48-drive X4500 thumper.

If you have PCI-X, go Marvell. PCIe? Go LSI, but stick to the JBOD-capable not-so-RAID HBAs. Don't just trust me, throw a $100 or two at these and try it yourself. You'll see a better investment than $800 at the larger RAID cards. I went the latter route and have paid dearly (Adaptec, LSI, you name it). What worked from the beginning and is working today are the Marvell cards here, and I've been playing with new systems that use the LSI 3442ER.

Labels: , , , , , ,

Saturday, May 31, 2008

Mixing SATA dos and donts

Another day, another bug seemingly hit. I've known for some time that mixing SATA-I and SATA-II devices on the same controller with regards to OpenSolaris seems to be unwise. I've already had systems with the initial ZFS-boot drive being a small capacity and thus likely SATA-I, but the data volumes were SATA-II. My recent issues with the iRAM could be related to having a SATA-I device after a SATA-II drive in the chain, but nothing has been concrete.

However, today I discovered something else. One array I have is made up of all SATA-I drives and was used by a SATA-I RAID card that went south. I happily replaced it with the Marvell SATA-II JBOD card, and it was working just fine. I then lost the 6th of 7 drives, and went back to the manufacturer to try and buy a replacement. Sadly, these Raid Edition drives have been "updated" to be at a minimum of SATA-II for the same model. Replacing the failed SATA-I with the SATA-II worked, but on subsequent reboots, the 7th drive tended to not be enumerated by the Marvell card at startup, and even after re-inserting it, a "cfgadm" was necessary to activate it. Even then, a "zpool import" or "format" to introspect the now configured drive would wedge and never complete the command. Weird, right?

The solution to return to stability was to swap the 6th and 7th drive, so that the SATA-II disk came after all the SATA-I devices in the chain. I'm not sure why it works, but every reboot works now, it never fails to enumerate that last drive and there is no need to manually cfgadm configure the drive post boot. Therefore, a set of truisms are starting to come together with mixed SATA drives. Whether Marvell, Sil3124, or the like, its never a good idea to mix SATA-I and SATA-II devices on a single controller, but if necessary, make sure that the SATA-II drives come after the SATA-I drives. The best configuration is to restrict SATA-I boot devices, such as small 5400 "laptop" drives to their own onboard SATA interface, and leave all SATA-II devices to add-on boards.

Labels: , ,

Tuesday, May 27, 2008

The problem with slogs (How I lost everything!)...

A while back, I spoke of the virtues of using a slog device with ZFS. The system I went into production with had an Nvidia-based SATA controller onboard and a Gigabyte i-RAM card. No problems there, but at the time it was a cmdk driver (PATA mode) for my OpenSolaris-based NexentaStor NAS. After a while, I got an error where the i-RAM "reset" and the log went degraded. The system simple started to use the data disks for the intent log. So, no harm done. Its important to note that the kernel was a B70 OpenSolaris build.

Later, I wanted to upgrade to NexentaStor 1.0, which had B85. Post upgrade or using a boot CD, it would never come up with the i-RAM attached. The newer kernel was an nv_sata driver, and I could always get it to work in B70, so I reverted to that. This is one nice feature that Nexenta has had for quite some time, in that the whole OS is checkpointed using ZFS to allow reversions if an upgrade doesn't take. Well, the NAS doesn't like having a degraded volume, so I've been trying to "fix" the log device. Currently, in ZFS, log devices cannot be removed, but only replaced. So, I tried to replace it using the "zpool" command. Replacing the failed log with itself always fails as its "currently in use by another zpool". I figured out a way around that, and that was to fully clear the log using something like "dd if=/dev/zero of=/dev/mylogdevice bs=64k". I was able to upgrade my system to B85, and then I attempted to replace the drive again, and it looked like it was working:


pool: data
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scrub: resilver in progress for 0h0m, 0.00% done, 450151253h54m to go
config:

NAME STATE READ WRITE CKSUM
data DEGRADED 0 0 0
raidz1 ONLINE 0 0 0
c7t0d0 ONLINE 0 0 0
c7t1d0 ONLINE 0 0 0
c6t1d0 ONLINE 0 0 0
logs DEGRADED 0 0 0
replacing DEGRADED 0 0 0
c5d0 UNAVAIL 0 0 0 cannot open
c8t1d0 ONLINE 0 0 0

errors: No known data errors


Note well, that it is replacing one log device with another (using the new nv_sata naming). However, after it reached 1% it would always restart the resilver with no ZFS activity, no snapshots, etc. The system was busy resilvering and resetting, getting no where. I decided to reboot to B70, and as soon as that came up, it started to resilver immediately and it proceeded after quite a long time for a 2GB drive to complete the resilver. So, everything was now fine, right?

This is where things really went wrong. At the end of the resilver, it still considered the volume degraded, and looked like the above output but with only one log device. Rebooting the system, the volume started spewing out ZFS errors, and checksums counters went flying. My pool went offline. Another reboot, this time with the log device disconnected due to nv_sata not wanting it connected for booting purposes causes immediate kernel panics. What the hell was going on? Using the boot cd, I tried to import the volume. It told me that the volume had insufficient devices. A log device shouldn't be necessary for operation, as it hadn't needed it before. I attached the log device and ran cfgadm to configure it, which works and gets around the boot time nv_sata/i-RAM issue. Now it told me that I have sufficient devices, but what happened next was worse. The output showed that my volume consisted of one RAIDZ, an empty log device definition, and additionally my i-RAM as an additional degraded drive added to the array as a stripe! No ZFS command was run here. It was simply the state of the system relative to what the previous resilver had accomplished.

Any attempt to import the volume fails with a ZFS error regarding its inability to "iterate all the filesystems" or something to that affect. I was able to mount various ZFS volumes read-only by using the "zfs mount -o ro data/proj" or similar. I then brought up my network and manually had to transfer the files off to recover, but this pool is now dead to the world.

What lessons have I learned? Slog devices in ZFS, though a great feature, should not be used in production until they can be evacuated. There may be errors in the actions I took above, but bugs that I see include the inability for the nv_sata driver to deal with the i-RAM device for some odd reason, at least in B82 and B85 (as I've so far tested). The other bug is that a log replace appears to either not resilver at all (B85) or, when resilvering in older releases, causes the system to not correctly resilver the log but instead to shim the slog in as a data stripe. I simply can't see how that is by any stretch of the imagination by design.

Labels: , ,

Saturday, May 03, 2008

ZFS: Is the ZIL always safe?

One of my ZFS-based appliances, used for long term backup, was upgraded from B70 to B85 of OpenSolaris two weeks ago. This time around, I re-installed the system to get RAIDZ2, and certain "hacks" that I've been using were no longer in place. The old settings were in /etc/system, and are the well known zil_disable and zfs_nocacheflush enabling. They were left there from when the system temporarily acted as a primary server for a short time with its Adaptec (aac) SATA RAID card and its accompanying SATA-I drives. Since the unit was UPS attached, it was relatively safe for NFS client access, and later on there was no direct client access over NFS. No harm done, and stable for quite some time over multiple upgrades from B36 or so, over a year without an error.

A curious thing happened as soon as I upgraded without these somewhat unsafe settings for the kernel. I started to get tons of errors and twice my pool as gone completely offline until I cleared and scrubbed it. An example of the errors:

NAME STATE READ WRITE CKSUM
tier2 DEGRADED 0 0 0
raidz2 DEGRADED 0 0 0
c1t1d0 FAULTED 0 64 0 too many errors
c1t2d0 DEGRADED 0 46 0 too many errors
c1t3d0 DEGRADED 0 32 0 too many errors
c1t4d0 DEGRADED 0 47 0 too many errors
c1t5d0 DEGRADED 0 39 0 too many errors
c1t6d0 FAULTED 0 118 0 too many errors
c1t7d0 DEGRADED 0 57 0 too many errors


Nothing explained the turnaround from stable to useless for any writes. I also got some read errors, and no nightly rsync against this tree would survive without incrementing some error count. Was it somehow one of my cache settings on the adaptec card that conflicted with a new version of the "aac" driver? I thought I would need to isolate it, revert perhaps, or consider that somehow my card was simply dying. Perhaps the cache/RAM on the card itself was toast.

A recent post on the opensolaris-discuss mailing lists gave me an idea. Mike DeMarco suggested to a user suffering from repeated crashes that corrupt ZFS until cleared to try and use zil_disable to test "if zfs write cache of many small files on large FS is causing the problems." Makes some sense if the card is somehow trashing on small writes. The use of it for backup means that its being read and written to via rsync and can involve many small updates. I also had various read errors pop up. So, I put the old faithful zil_disable and for good measure the zfs_nocacheflush back after another degraded pool, and after a reboot and scrub, let it do its nightly multi-terabyte delta rsyncs. After a few days, there are no errors. Have I stumbled onto some code path bug that is ameliorated by these kernel options? Do newer kernels have suspect aac drivers?

Perhaps someone will prove the logic of the above all wrong, but for now, I'm returning to the old standby "unsafe" kernel options to keep my pool stable.

Labels: , , ,