Saturday, August 02, 2008

Amanda: simple ZFS backup or S3

When I first started researching ZFS, I found it somewhat troubling that no native backup solution existed. Of course there was the ZFS send/recv commands, but those didn't necessarily work well with existing backup technologies. At the same time, the venerable open source backup solution, amanda had found a way to move beyond its limitation of maximum tape size restricting backup run size. Over time, we have found ways to marry these two solutions.

In my multi-tier use of ZFS for backup, I always need an n-tier component that will allow for permanent archiving to tape every 6 months or year, as deemed fit for the data being backed up. These are full backups only, and due to the large amounts of data in the second tier pool, a backup to tape may span dozens of tapes and run multiple days. I found I had to tweak amanda's typical configuration to allow for very long estimate times, as the correct approach to backing up a ZFS filesystem today involves tar. Amanda's approach does a full tar estimate of a backup before a real backup is attempted. Otherwise, a sufficiently tape library is all you need and a working amanda client configuration on your ZFS-enabled system.

For those following along, I'm an avid user of NexentaStor for my second tier storage solution. Setup of an amanda client on that software appliance is actually quite easy.

setup network service amanda-client edit-settings
setup network service amanda-client conf-check
setup network service amanda-client enable

That's all that one needs to do. There is a sample line in the amanda configuration that you adjust in the first command above. The line I used is similar to this: amanda amdump

You'll find that depending on your build of amanda server, that you'll either have the legacy user name of "amanda", the zmanda default of "amanda_backup", or the Redhat default of "backup" as the user things run as. I guess there had to be a user naming conflict at some point with "amanda".

The hardest part of the configuration is finding where you have your long term snapshots. Since a backup run can take days to weeks, you'll likely wish to backup volumes relative to a monthly snapshot. In your amanda /etc/amanda/CONFIDR/disklist configuration, a sample you may have for a ZFS-based client named nexenta-nas with volume tier2/dir* is:

nexenta-nas /volumes/tier2/dir1/.zfs/snapshot/snap-monthly-1-latest user-tar-span
nexenta-nas /volumes/tier2/dir2/.zfs/snapshot/snap-monthly-1-latest user-tar-span

Note well the use of user-tar-span in the two lines above. This allows for the backing up large volumes over multiple tapes in amanda. That one limitation of tape spanning in amanda was solved in a novel way. They break up backup streams into "chunksizes" of a set size to allow for a write failure at the end of one tape to begin fresh again at the beginning of that chunk on the following tape. This feature allows amanda to also be used to backup to Amazon's S3 service. Yes, instead of going to tape, you can configure a tape server to write to an S3 service. S3 limits writes to a maximum of 2GB a file, and amanda's virtual tape solution combined with that chunk sizing of backups works wonderfully to mate ZFS-based storage solutions to S3 for an n-tier solution. Please consult Zmanda's howto for configuring your server correctly. There really is nothing left to configure to get ZFS data to S3.

Sunday, July 27, 2008

Pogo Linux, Nexenta announce StorageDirector Z-Series storage

Pogo Linux Inc., a Seattle-based storage server manufacturer, and Nexenta Systems Inc., developer of NexentaStor, an open storage solution based upon the revolutionary file system ZFS, announced Wednesday immediate availability of a new set of storage appliances featuring NexentaStor.Yeah.. that was the posted text above. What does it really mean? More kit choices to get a open storage NAS. Some nice configuration options when ordering, but I didn't see an easy was to request smaller system disks versus the rest of the data drives for any given Z series unit. Its a very good first step. If a Linux vendor adopts an appliance based on OpenSolaris (albeit a Debian/Ubuntu-lookalike), you know there is something cooking.

read more | digg story

Monday, June 16, 2008

Closet WO Developer

One of many hats I wear is that of a erstwhile java developer. Our internal apps have been heavily reliant on object relational mappers. I've dabbled into RoR, and even helped get a TurboGears project off the ground here that was fully open source. However, the primary solution we've used in production since 2002 has been the grand daddy of ORM solutions: WebObjects.

This past week at Apple's WWDC was a great one for WebObjects. The usual NDA applies. However, prior to that the WebObjects community had their own two day in-depth conference in San Francisco. No NDA for that, and I can report that WO development is alive, well, and dare I say thriving? The news about SproutCore has a second story, in that the backend of choice may be RoR, but the #1 apps will likely also be WO-based. Got an iPhone? Learn WO. As we opensource some of our projects here, I'll write a few more posts and speak on some more points, but with the latest release of WebObjects (5.4.x) the final deployment restrictions on the free WO frameworks were lifted. I expect some level of renewed interest.

To find out more, check in with the WOCommunity.

Wednesday, June 04, 2008

Recommended Disk Controllers for ZFS

Since I've been using OpenSolaris and ZFS (via NexentaStor, plug plug) extensively, I get a lot of emails asking about what hardware works best. There have been various postings on the opensolaris and zfs lists to the same effect. A lot of people reference the OpenSolaris HCL lists which leave the average user scratching their head with more questions than answers. More to the point, the HCL doesn't tend to answer the more direct question of what hardware should I get to build a ZFS box, NAS, etc. Its important to note that in the case of ZFS, all that extra checksum, fault management, and performance goodness can be negated by selecting a "supported" hardware RAID card. Worse yet, many RAID cards are not fully interchangeable on the spot. What do you want for ZFS?

First, pick any 64-bit dual core or better motherboard or processor. If you can get ICH6+, nvidia, or Si3124-based on board SATA, then you are in good shape for your basic ZFS box with on-board SATA for your system disks alone. System disk can tend to be low 5400RPM 2.5 inch SATA-I drives. Many people then desire some large memory, battery-backed RAID card, and my tests with the high end LSI SAS cards show that memory on the RAID card doesn't do you as much good as having a recipe of lots of system RAM, a sufficient number of cores, many disk drives for the spindles, and sufficient use of the PCIX/PCIe bus using JBOD only disk controllers. I'll cover the controllers next, but I'd recommend at this point 4GB of RAM minimum, dual core at greater than 2ghz, and for any good load, at least two PCI-X or multi-lane PCIe card.

Disk controllers are where the real questions are asked. Over multiples iterations, heavy use, and some anecdotal evidence, we are down to some sweet spots. For PCI-X, there is one game in town, the Marvell-based AOC-SATA2-MV8, used in the X4500. At $100 for 8 JBOD SATA-II ports, it just works and is fault managed. Stick just SATA-II disks on these, and keep any SATA-I disks on the motherboard SATA ports for system disks. I'll add that various Si3124 based cards exist here, but not for sufficient port density.

SuperMicro AOC-SATA2-MV8 link

When it comes to PCIe, there isn't any good high port count options for SATA. If you need just 2 ports, or eSATA, there are various solutions based on the Si3124 chipset, and SIIG makes many of them for $50 each. However, in the PCIe world, the real answer is SAS HBAs that connect to internal or external mixed SAS/SATA disk chassis. Again, most SAS HBAs are either full fledged RAID without JBOD support, or simply don't work in the OpenSolaris ecosystem. 3ware is a lot cause here. The true winner for both cost and performance, while providing the JBOD you want, is the LSI SAS3442E-R.

CDW catalog link for LSI 3442ER
LSI 3442ER product page

Its $250, but I've seen it as low as $130. 8 channels, with both 2 internal ports (generally 8 drives are connected to a single SAS port) as well as the external port. You can use this with an external SAS-backed array of SATA drives from Promise, for instance, to easily populate 16 or 32 drives internally, with an additional 48 drives externally, just from the one card. Would I suggest that many on that single card? No, but you can. Loading up your system with 2 or 4 of these cards, which are based on the LSI 1068 chipset that is well supported by Sun is the best way forward for scale out performance. I was given some numbers of 200MB/sec writes and 400MB/sec reads on an example 12-drive system using RAIDZ. Good numbers, as I got 600MB/sec reads on a 48-drive X4500 thumper.

If you have PCI-X, go Marvell. PCIe? Go LSI, but stick to the JBOD-capable not-so-RAID HBAs. Don't just trust me, throw a $100 or two at these and try it yourself. You'll see a better investment than $800 at the larger RAID cards. I went the latter route and have paid dearly (Adaptec, LSI, you name it). What worked from the beginning and is working today are the Marvell cards here, and I've been playing with new systems that use the LSI 3442ER.

Saturday, May 31, 2008

Mixing SATA dos and donts

Another day, another bug seemingly hit. I've known for some time that mixing SATA-I and SATA-II devices on the same controller with regards to OpenSolaris seems to be unwise. I've already had systems with the initial ZFS-boot drive being a small capacity and thus likely SATA-I, but the data volumes were SATA-II. My recent issues with the iRAM could be related to having a SATA-I device after a SATA-II drive in the chain, but nothing has been concrete.

However, today I discovered something else. One array I have is made up of all SATA-I drives and was used by a SATA-I RAID card that went south. I happily replaced it with the Marvell SATA-II JBOD card, and it was working just fine. I then lost the 6th of 7 drives, and went back to the manufacturer to try and buy a replacement. Sadly, these Raid Edition drives have been "updated" to be at a minimum of SATA-II for the same model. Replacing the failed SATA-I with the SATA-II worked, but on subsequent reboots, the 7th drive tended to not be enumerated by the Marvell card at startup, and even after re-inserting it, a "cfgadm" was necessary to activate it. Even then, a "zpool import" or "format" to introspect the now configured drive would wedge and never complete the command. Weird, right?

The solution to return to stability was to swap the 6th and 7th drive, so that the SATA-II disk came after all the SATA-I devices in the chain. I'm not sure why it works, but every reboot works now, it never fails to enumerate that last drive and there is no need to manually cfgadm configure the drive post boot. Therefore, a set of truisms are starting to come together with mixed SATA drives. Whether Marvell, Sil3124, or the like, its never a good idea to mix SATA-I and SATA-II devices on a single controller, but if necessary, make sure that the SATA-II drives come after the SATA-I drives. The best configuration is to restrict SATA-I boot devices, such as small 5400 "laptop" drives to their own onboard SATA interface, and leave all SATA-II devices to add-on boards.

Tuesday, May 27, 2008

The problem with slogs (How I lost everything!)...

A while back, I spoke of the virtues of using a slog device with ZFS. The system I went into production with had an Nvidia-based SATA controller onboard and a Gigabyte i-RAM card. No problems there, but at the time it was a cmdk driver (PATA mode) for my OpenSolaris-based NexentaStor NAS. After a while, I got an error where the i-RAM "reset" and the log went degraded. The system simple started to use the data disks for the intent log. So, no harm done. Its important to note that the kernel was a B70 OpenSolaris build.

Later, I wanted to upgrade to NexentaStor 1.0, which had B85. Post upgrade or using a boot CD, it would never come up with the i-RAM attached. The newer kernel was an nv_sata driver, and I could always get it to work in B70, so I reverted to that. This is one nice feature that Nexenta has had for quite some time, in that the whole OS is checkpointed using ZFS to allow reversions if an upgrade doesn't take. Well, the NAS doesn't like having a degraded volume, so I've been trying to "fix" the log device. Currently, in ZFS, log devices cannot be removed, but only replaced. So, I tried to replace it using the "zpool" command. Replacing the failed log with itself always fails as its "currently in use by another zpool". I figured out a way around that, and that was to fully clear the log using something like "dd if=/dev/zero of=/dev/mylogdevice bs=64k". I was able to upgrade my system to B85, and then I attempted to replace the drive again, and it looked like it was working:

pool: data
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scrub: resilver in progress for 0h0m, 0.00% done, 450151253h54m to go

data DEGRADED 0 0 0
raidz1 ONLINE 0 0 0
c7t0d0 ONLINE 0 0 0
c7t1d0 ONLINE 0 0 0
c6t1d0 ONLINE 0 0 0
logs DEGRADED 0 0 0
replacing DEGRADED 0 0 0
c5d0 UNAVAIL 0 0 0 cannot open
c8t1d0 ONLINE 0 0 0

errors: No known data errors

Note well, that it is replacing one log device with another (using the new nv_sata naming). However, after it reached 1% it would always restart the resilver with no ZFS activity, no snapshots, etc. The system was busy resilvering and resetting, getting no where. I decided to reboot to B70, and as soon as that came up, it started to resilver immediately and it proceeded after quite a long time for a 2GB drive to complete the resilver. So, everything was now fine, right?

This is where things really went wrong. At the end of the resilver, it still considered the volume degraded, and looked like the above output but with only one log device. Rebooting the system, the volume started spewing out ZFS errors, and checksums counters went flying. My pool went offline. Another reboot, this time with the log device disconnected due to nv_sata not wanting it connected for booting purposes causes immediate kernel panics. What the hell was going on? Using the boot cd, I tried to import the volume. It told me that the volume had insufficient devices. A log device shouldn't be necessary for operation, as it hadn't needed it before. I attached the log device and ran cfgadm to configure it, which works and gets around the boot time nv_sata/i-RAM issue. Now it told me that I have sufficient devices, but what happened next was worse. The output showed that my volume consisted of one RAIDZ, an empty log device definition, and additionally my i-RAM as an additional degraded drive added to the array as a stripe! No ZFS command was run here. It was simply the state of the system relative to what the previous resilver had accomplished.

Any attempt to import the volume fails with a ZFS error regarding its inability to "iterate all the filesystems" or something to that affect. I was able to mount various ZFS volumes read-only by using the "zfs mount -o ro data/proj" or similar. I then brought up my network and manually had to transfer the files off to recover, but this pool is now dead to the world.

What lessons have I learned? Slog devices in ZFS, though a great feature, should not be used in production until they can be evacuated. There may be errors in the actions I took above, but bugs that I see include the inability for the nv_sata driver to deal with the i-RAM device for some odd reason, at least in B82 and B85 (as I've so far tested). The other bug is that a log replace appears to either not resilver at all (B85) or, when resilvering in older releases, causes the system to not correctly resilver the log but instead to shim the slog in as a data stripe. I simply can't see how that is by any stretch of the imagination by design.

Saturday, May 03, 2008

ZFS: Is the ZIL always safe?

One of my ZFS-based appliances, used for long term backup, was upgraded from B70 to B85 of OpenSolaris two weeks ago. This time around, I re-installed the system to get RAIDZ2, and certain "hacks" that I've been using were no longer in place. The old settings were in /etc/system, and are the well known zil_disable and zfs_nocacheflush enabling. They were left there from when the system temporarily acted as a primary server for a short time with its Adaptec (aac) SATA RAID card and its accompanying SATA-I drives. Since the unit was UPS attached, it was relatively safe for NFS client access, and later on there was no direct client access over NFS. No harm done, and stable for quite some time over multiple upgrades from B36 or so, over a year without an error.

A curious thing happened as soon as I upgraded without these somewhat unsafe settings for the kernel. I started to get tons of errors and twice my pool as gone completely offline until I cleared and scrubbed it. An example of the errors:

tier2 DEGRADED 0 0 0
raidz2 DEGRADED 0 0 0
c1t1d0 FAULTED 0 64 0 too many errors
c1t2d0 DEGRADED 0 46 0 too many errors
c1t3d0 DEGRADED 0 32 0 too many errors
c1t4d0 DEGRADED 0 47 0 too many errors
c1t5d0 DEGRADED 0 39 0 too many errors
c1t6d0 FAULTED 0 118 0 too many errors
c1t7d0 DEGRADED 0 57 0 too many errors

Nothing explained the turnaround from stable to useless for any writes. I also got some read errors, and no nightly rsync against this tree would survive without incrementing some error count. Was it somehow one of my cache settings on the adaptec card that conflicted with a new version of the "aac" driver? I thought I would need to isolate it, revert perhaps, or consider that somehow my card was simply dying. Perhaps the cache/RAM on the card itself was toast.

A recent post on the opensolaris-discuss mailing lists gave me an idea. Mike DeMarco suggested to a user suffering from repeated crashes that corrupt ZFS until cleared to try and use zil_disable to test "if zfs write cache of many small files on large FS is causing the problems." Makes some sense if the card is somehow trashing on small writes. The use of it for backup means that its being read and written to via rsync and can involve many small updates. I also had various read errors pop up. So, I put the old faithful zil_disable and for good measure the zfs_nocacheflush back after another degraded pool, and after a reboot and scrub, let it do its nightly multi-terabyte delta rsyncs. After a few days, there are no errors. Have I stumbled onto some code path bug that is ameliorated by these kernel options? Do newer kernels have suspect aac drivers?

Perhaps someone will prove the logic of the above all wrong, but for now, I'm returning to the old standby "unsafe" kernel options to keep my pool stable.

Thursday, April 03, 2008

OpenDirectory upgrade path from 10.4 to 10.5

In EE we've migrated over from various AD and OpenLDAP installations to what we hope is a more manageable solution long term. Sadly, upgrading OpenDirectory (MacOSX OpenLDAP-based directory services) from 10.4 to 10.5 doesn't work as Apple states it would. Here's the complete recipe we used to keep our data, our passwords, and most importantly, our domain SID. Apple tends to not care about maintaining the SID in various replica-to-master promotion steps.

First, a reference to the cookbook  doing things the hardway.

As recommended in the above and from other postings, upgrades do not work. Rather, what needs to be done is this:

10.4 Server:

1) go to Server Admin, OpenDirectory, and under the Archive tab, generate an archive of the OpenDirectory DB. Place in admin home directory
2) For safe keeping, go to /var/db/samba and get the secrets.tdb file. Place in admin home directory (readable by all)
3) get the current SID by running as root/sudo "net getdomainsid EE" where EE is the domain we are supporting. Place in home directory
4) copy off to a 3rd party machine the above three files/directories

10.5 Server:

1) Install fresh, and use the exact same IP and name as the 10.4 Server. You'll likely need to have these are their own net. Also note that without a link on the primary interface, smb, dns, and opendirectory don't work. I suggest connecting to the third party machine listed above, in my case my laptop's physical connection which I assign to the private net
2) You'll need DNS setup temporarily, so create a DNS server for your domain ( and create a host entry for your self. Point local network settings to self as DNS server
3) copy over the files saved from 10.4 from the laptop/3rd party machine
4) Make an OpenDirectory Master, using the correct domain "dc=ee,dc=stanford,dc=edu" and correct KRB realm "EE.STANFORD.EDU"
5) import the archive of 10.4
6) run as root "mkpassdb -kerberize"
7) Create a new PDC config for Windows. Use the directoryadmin account/password to give samba correct access to the OpenDirectory DB
8) edit /var/db/smb.conf to fit the /etc/smb.conf entries you had on 10.4. Likely you'll want to make "local path = " and add "admin users = directoryadmin, domainjoin, @admin" or the like, where the first is the directory admin acct, the second is a PDC join account that can't login, but has directory admin rights. @admin works to include anyone in admin group
9) run as root "chflags uchg /var/db/smb.conf" to freeze your samba config. Recommend making a copy as well in the same dir.
10) run as root "net setdomainsid (SID)" where SID is the one you saved from 10.4
11) Go into Workgroup Manager. Change preferences to enable Inspector. Go into Inspector and select "Config" and then "CIFSServer". The two Value lines with "xml version.." need to have Edit run against them, and replace the SID line in each with the SID you just used.
12) restart Samba/Windows services. Check SID with, as root, "net getdomainsid" and "net getlocalsid EE" or the like. If anything didn't stick, do 10, 11 again.
13) before going live, one needs to remove reference to the local DNS in Network preferences, and optionally disable DNS service. This setup also was only tested with Wins service enabled as the WINS Server
14) test, test, test from Windows including domain logins, enumeration of groups in windows for adding domain users, etc. Logs may show if accounts are failing.

On Windows, the simple tests you can do involve the utility "nltest" which is in the free SUPPORT TOOLS (but may not be installed by default). nltest /? gives commands although OS-X samba only supports some of them. list PDC and BDCs --- nltest /dclist:your_domain

nltest /dclist:ee
Domain 'ee' is pre Windows 2000 domain. (Using NetServerEnum).
List of DCs in Domain ee
The command completed successfully verify schannel --- nltest /sc_query:your_domain
C:\>nltest /sc_query:ee
Flags: 0
Trusted DC Name \\EE-OD
Trusted DC Connection Status Status = 0 0x0 NERR_Success
The command completed successfully

To do a more detailed check, you can open the Windows Manager and try to look at the members of the Administrator group for the machine. When we had trouble, it just showed raw SID numbers, even for EE\DomAdmins. Once it was fixed, then that showed correctly.

Error cheat sheet:

1. If smb logs show that directoryadmin or domainjoin and the like have the "wrong sid" in passdb, you'll need to demote/promote Windows Servers to workgroup and back to PDC. You'll need to run "chflags nouchg /var/db/smb.conf" first and copy back your copied version after repromotion as the file will be rewritten. Do step 9-12 again above

2. If kerberos isn't effectively working on clients, you may need to reimport the archive OpenDirectory, rerun "mkpassdb -kerberize" and follow the above demote/promote steps.

Have NAS, Want Shell

Now that anyone can grab Nexenta's NAS product, many will undoubtedly want to get under the hood, especially developers. First, a fair warning that although the management infrastructure is resilient to many changes done manually, modifying various service configurations outside of Nexenta's internal version control can lead to one or two headaches if you aren't careful. That said, give me a shell!

Well, that's simple. When you login via the console (ssh, for example), simply run "setup appliance nmc edit-seettings". You can tab your way through that command as well. Once there, go and edit expert_mode to be "1". Yes, you've enter the "vi" command zone, so save and exit with ':wq'

Once the changes are saved, you'll be asked to refresh the console settings, and now you can type "!bash" to get a nice usable shell, or bang escape any command. You'll be root, so, be aware and behave! Now you know what Nexenta Core was all about, as its all there at your fingertips, along with NMS, NMC, and NMV subsystems that are the foundation of the NAS product.

I was told that an alternative way to set expert mode is
option expert_mode = 1 -s
as denoted in the "option -h" documentation for NMC. The "-s" flag updates the on-disk configuration.

Developers, developers, developers...

Ever wanted that NAS on your own hardware, for free? Nexenta has finally released their NexentaStor Developer Edition 1.0, which is free version of their commercial product with only a 1TB limit on used storage. All functionality otherwise is there, unlimited. This is a near final release for the commercial version, but is the first version the general public can get and install on their own hardware.

The release represents many things, but the Developer releases are focused on more than just tire kicking or a free NAS product for your home NAS needs. Rather, there is a lot of potential to extend and use Nexenta's SA-API for storage service-enabled solutions. Wish to modify your DB to wrap a transaction in a snapshot? Need to automate separate file system creation, quotas, etc for your users? Registered users on the web site can look at an overview of the architecture and sample SA-API components. I expect much more in the way of API details in the near future. However, the release of the product is here today.

A general support forum is also available

There are two other automation aspects to NexentaStor that I haven't given much love to here. Both utilize the batch nature of NMC, the Nexenta Management Console. One is the 'query' functionality, which allows various introspections on the NAS and can query across multiple appliances at once if they are grouped together (the group function). In a similar vein, there is the NMC recording facility, which is handled by the "record" command. Recording allows you to save and play back actions for various tasks, including over a network of NAS devices. All of these commands have ready examples available by invoking the command with the "-h" help argument in the console. There is also good stuff in the User Guide which is available for download.

Friday, March 21, 2008

Step by Step CIFS Server setup with OpenSolaris

After CIFS Server was released into the OpenSolaris wild, I could not for the life of me get it to work. Even in the post B82 stage, the random collection of documentation led me astray multiple ways. I think part of the problem is that I read up on it too much and thus old requirements were no longer accurate and got in the way. You need to setup your krb5.conf file right? LDAP too? The final resolution appears to be rather straight forward, and it likely shows other steps I had taken previously were left rotting on my system and prevented a working solution.

So, what do you actually need? I'd recommend starting with at least B85. In my case I used the latest NexentaOS unstable release (1.0.1 to be) which includes B85 and by default the necessary Sun smb packages. For my test, I created a contrived domain using Windows 2003 Server (SP2) called WIN.NEXENTA.ORG. The rest follows:

add to /etc/resolv.conf:
(Nameserver is our AD DNS server)

(optional: run ntpdate against your time server)
#svcadm enable svc:/network/ntp:default
#svcadm enable -r smb/server
#smbadm join -u Administrator

#zfs set sharesmb=on data/myshare
#zfs set sharesmb=name=myshare data/myshare

#mkdir /data/myshare/jlittle
#chown jlittle /data/myshare/jlittle

#idmap add 'winuser:*' 'unixuser:*'
#idmap add "wingroup:Domain Users' 'unixgroup:staff'

#svcadm restart smb/server
#svcadm restart idmap

Other advisable steps include "zfs set casesensitivity=mixed data/share" for correctness of Windows users, but likely not ideal if the zfs filesystem shared is also shared to NFS clients. You know if its all working if "idmap dump" gives you real values and not just returns to the prompt. I connected to my new share via a MacOSX client, and made sure my domain matched as "" when connecting to my share (aka smb://server/myshare/jlittle).

In the end, it was much simpler than the documents suggested. I had to avoid explicitly stating the domains in idmap to make idmap do the right thing. You should pick the right local group for your users in the mapping for groups. I picked "staff" as that was the default group of my user.

Monday, March 03, 2008

Random Storage Comments, Answered

In my last posting, a lot of comments covered wide and varied ground. First, its important to note that even with CDP underlying ZFS pools, ZFS itself provides for its own integrity of state.  If CDP didn't complete a transaction, a re-sync will generally resolve it, but the actual hosted ZFS filesystem need not fear and its transactions won't be finished until the write is checksumed. I agree that there are failure modes here, but that leads to a good quote in one of the comments:

"To that end, it seems to be that whenever a choice can be had between doing something simple to accomplish a goal and chaining a bunch of parts together to accomplish the same goal with more sophistication, its likely the simpler solution will be more sustainable over time."

I concur. Nexenta marries two pieces of functionality to get auto-cdp, and they rely on the two components in whole to maintain overall simplicity of implementation. The real value that they have provided is in making the front end dead simple. If the management isn't simplified, any level of underlying functionality will be lost in the long run.

I want to focus more on the simplicity of the performant NAS solutions. Mentioning pNFS, lustre, and the like, we know that the client becomes a bit less transparent, and definitely the backend store of data becomes somewhat opaque as data is no longer consistent per one server, but is spread out across the whole back end. Even though you need newer clients with specific functionality in both cases, it can again be more simple than the alternative, which generally involves an NFS v3 client using automounts, LDAP-based auto mount maps, and heavy handed data management on the backend to scale out in similar ways over multiple NAS heads. The tact of taking a single high end head with best of breed backend hardware, such as IB interconnects to SAS disk arrays and 10GB ethernet out the front might seem to work, but we have already seen pathological conditions where a single heavy client writing millions of small files can make that enhanced hardware meaningless for performance.

There is no fast answer to solving both scale out with regards to capacity and performance without a little give on each aspect of the design. What makes it all reasonable to consider is if the entire solution is made greatly more simple to manage than the alternatives at either end of the design spectrum. In the end, simplicity of manageability will trump other considerations. As long as simplicity is strictly maintained in the product, the underlying complexity will seem well worth the effort. We just need to trust that someone gets the fine details. In the end, we don't mind that we can't muck much with a highly efficient but high performance car. As long as we feel mastery over its operation and trust in the quality of the build and service by the manufacturer, we all are willing to make the investment.

Thursday, February 07, 2008

The ZFS scaling and DR question

In my dealings with using ZFS-based NAS and 2nd tier solutions, I've been blessed to hear from different people with thoughts that push the discussion forward. The ZFS space is where a lot of long term strategies that utilized commoditized components are covered. Other spaces that I follow seem to be somewhat stale, or consider specialized point systems that solve problems between just two or three chosen vendors. ZFS is open, so I think its ecosystem can only grow. I'm happy to have permission from Wayne Wilson to re-state one of his emails, and use it for a discussion point.

I see two different paths to scale:

1) Use a single system to run NexentaStor and have it mount
the remaining systems via iSCSI.  This would probably have to
be done over 10 Gbs or infiniband links.

2) Create Multiple NexentaStor systems and only present them
as a unified file system via NFS V4.  This would leave the
CIFS clients restricted to mounting multiple volumes, but that
may be ok.

Is this it or is there some other way?

Next architectural issue is how to do DR and archiving.

There are two types of DR - Human initiated and machine initiated.

My standard strategy for the Human initiated DR is to use
snapshots and keep enough of them around to answer most restore
requests. For machine initiated, my worst case is when the
storage subsystem (either a complete Thumper or a complete array)
fails.  For this I can find no other solution than to replicate.

  As you have pointed out, replication usually locks you
into the backend storage vendors system, whereas it would be
better for us consumers to be able to 'mirror' across two
disparate storage backends.

Here is where things might fall apart in using Thumpers. We
could probably spec out a really high I/O throughput 'head end'
type server to load Nexenta on.  Then we could present any kind
of storage as iSCSI LUN's to the system. Let's say we use a
Thumper 48TB system to as an iSCSI target for our head end, then
we could use an Infortrends (or some such) SATA/iSCSI array for
other LUN's and let ZFS mirror across them.......and rely on
snapshots for human based DR and mirror failure protection for
machine based DR.

Then that leaves us with Archiving. I think that here is where
a time based tier, or at least the ability to define a tier based
on age would be useful.  If we set the age to a point beyond which
most changes are taking place (letting snapshot's take the brunt
of the change load before archiving), then it is likely that we
will have just one copy of the data to archive off, for most files.

What we would want to do is the make tape the archival tier. I am
uncertain as to how to do this.  Should it be done using vendor
backup software to allow catalogs and GUI retrievals?

Wayne covers how to scale out NAS heads based on ZFS, as well as the standard DR and archive question.

Considering the NAS head scenario, it is plain that long term, running storage across multiple NAS heads using upcoming technologies as they mature, such as Lustre or pNFS, will be necessary to scale out with both performance and capacity. However, it is reasonable to consider solutions that utilize DAS/IB/external SAS/iSCSI to take one head node and approach petabyte levels. I consider this within reason if the target is second tier or digital archive storage, where performance isn't king. With the best of hardware, perhaps the single head node (or HA config) will have sufficient performance for most primary storage deployments. Time will tell as our needs require such solutions and technologies we have employed improve.

Disaster Recover I brought up in a recent post. Using file based backup solutions works well in the act of backing up, but restoration at a large scale is wanting especially if file numbers become more dominant than file size. The next beta of NexentaStor happily has taken a large leap forward in addressing this by implementing a very simply to manage auto-cdp (continuous data protection) service across multiple NAS heads. This keeps multiple storage solutions in sync as data is committed, operates below the ZFS layer, and is bidirectional. Yes, the secondary system can import the same LUNs or ZFS pools and re-export them to your clients. Just as important, if you lost the primary host, synchronization can be reversed to restore at full block level speeds your primary system.

If you take this approach, and also consider exposing native zvols or NFS/CIFS to your clients (such as a mail server), they too can use their local DAS storage under any OS and filesystem, but they can use native backup solutions to the ZFS exported volumes to regularly backup block-level dumps to allow speedy block-level restores. A mixture with this and file level backups even permits less frequent full dumps and greater granularity in recovery. In the end, you'd hope to have these wonderful features on your server OS directly to prevent having to do DR, but you can see that we are approaching reasonable answers.

The final issue brought up is archival, and I hope my previous posts have gone far in answering it. In general, I believe disk based archival solutions need to be employed before tape is considered, and tape should be fully regulated to final archival stages only. Today, you can use multiple open (Amanda/Bacula) and closed backup software solutions to write to tape libraries from trailing edge snapshots. I also know that though in its infancy, the NDMP client services evolving for ZFS will someday allow easier integration into current backup systems, allowing most people to convert existing tape based solutions completely into their last tier archive, running infrequently for long periods with just full backups.

All the above is just my "its getting better" perspective. Perhaps you can find some glaring weakness. I hope shortly you can all see the auto-cdp service that Nexenta has put together in action. Its well worth the wait.

Thursday, January 17, 2008

Using the iRam: Improving ZFS perceived transaction latency

I've been long overdue in reviewing the Gigabyte iRam card and its affect on performance of your favorite ZFS NAS product. NexentaStor already supports log devices, so the time appeared right to get one of these for a client I consult with to help deal with the noticeable pauses one can see when heavy reads and writes compete on a ZFS pool. I hope that the single threaded nature of those commits is resolved at some future point, but the iRam card appears to be a simple way to inject an NVRAM-like device into your commodity NAS solution.

The card itself is simply four DIMM sockets for DDR RAM, with a battery backup, reset switch, power driven from a PCI bus, and a single SATA-I connection to plug the unit into your existing SATA interfaces. Already you can see that the performance limit is 150MB/sec based on the SATA-I spec. What does this card do though? Near instant reads and writes in a safe battery-backed ramdisk that your system sees as a 2GB or 4GB drive, just what you'd want for a dedicated write commit device. In the case of many spindles in an array, you likely can do better than this device for true performance, but in the case of many small commits, the near perfect latency of RAM is much more ideal to keep writes happening without stalling the drives for reads. Since its a "slog" device by ZFS terms, it will regularly commit to the real underlying storage at full disk bandwidth. Therefore, even when writes must compete with reads on the physical disk, you limit your exposure to perceived stalls in I/O request even in the higher load cases.

For my non-production test, I actually put together the worse case scenario: An iSCSI backed ZFS array with NFS clients and many small files. In this case, any NFS writes require 3 fsyncs on the back end storage as required by NFS (create,modify,close). This is actually similar to CAD libraries, which the test was made to reflect. Using iSCSI devices, you can inflict much higher latencies. My iSCSI targets are actually older SATA-I drives themselves on a SBEi Linux based target using 3ware 8500s. Again, no where near ideal.

Creating a directory of 5000 small 8k files, I copied this from a linux gig-e connected client to a ZFS pool (made of two non-striped iSCSI luns), and got a meager 200K/sec write performance over NFS. If I stripe the data instead in the ZFS pool, I increased the numbers to 600K/sec at some points. Adding a 2GB Gigabyte iRam drive, I increased those numbers up to 9MB/sec, but averaging around 5MB/sec overall. That's at least 10 times the performance. Again, this test involves many i/o operations instead of using any bandwidth.

How fast can data be written to and read from that log device? My tests showed that 100MB/sec for reads and writes were common, with writes only bursting to those numbers for larger streaming data sets. In the case of the iSCSI nodes in question, each one could be pulled at a top rate of 45MB/sec, but averaging closer to 27MB/sec. Nominally, you can see that we are 3x better than at least these gig-e iSCSI devices.

The final production installation of the iRam device was with a SATA-II DAS array, and even in heavier load scenarios, we saw the wait cycle for write commits to the drives limited, and a steady 100+MB/sec use of the commit log (reads and writes). The only caveat for using such a device is that the current builds of OpenSolaris and thus NexentaStor do not allow you to remove it once added to a pool. A future release is supposed to address that.

Friday, January 11, 2008

Swept Under the Rug

In our day to day management of technology, we tend to pick paths that resolve the most pressing pain points. Inadvertently, we often also sweep certain problems under the rug, awaiting the day when it all must be cleaned up. Many choices do exactly this, solving the present problem while creating perhaps larger problems down the road. In my evolving strategy on storage, the move away from tape to disk-based online storage solves the most obvious problems but creates a whole series of other problems, including file based disaster recovery, long term maintenance of the underlying disk technology, true long term persistence of data, and general accessibility of the data by future technology. Today, I'll focus only on our next major pain point, disaster recovery.

Recently, a few instances occurred that underscored the need for better thought out solutions than what we already put in place. We thought we may be ahead of the curve with tiered copies of data on secondary NAS solutions, with our backup windows well within reason. Its obvious we made the right choice in doing incremental file based backups to secondary NAS, as the end data containers are universal across network file protocols. Recovery of any given file or perhaps even full data store recovery still beats that of tape libraries multiple times over. However, the architecture in place has allowed us to scale from the gigabyte world to the terabyte world. Our backup windows are well in hand, and spot recovery is a cinch. But there are some problematic disaster recovery scenarios.

The first scenario was just felt a week ago. A mere 50GB file store of Maildir formatted mail, where each message is a file itself, with mail folders represented by directories, had write errors on its underlying Linux XFS volume. This is by far not our largest install of such. Various mail servers for separate organizations we deal with are over 500GB in size. We suspect the RAID card's NVRAM was toast, disallowing further writes, and we had to migrate the mail to another server quickly. Simple enough, let's recover from our second tier mail store, right? The attempt was made, but we found ourselves limited not by the reading of millions of small 1K files so much as recommitting those files onto a journaled filesystem. The metadata updates of the files alone were bad enough. In the end, we were limited by file operations per second, and not pure bandwidth to the disk. Our estimated time of recovery was a minimum of 14 hours, and only for 50GB. A clue to the long term solution to this was in how we restored everything in less than 2 hours. In this case, we relied upon an xfsdump from the read-only failing array to a new filesystem on the spare hardware.

The obvious up front answer to disaster recovery of data in a multi-terabyte world is to make sure you have copies of everything in as close to a high availability setup as you can afford. If the underlying RAID array was actually two arrays, with software mirroring across the two, or if it were two separate machines that either attached to shared mirrored arrays or otherwise mirror their underlying RAID arrays over the network, we'd all just worry about natural disasters. Preventing the true disaster recovery scenario up front is the only true way to win, but most of us simply don't have the luxury, the resources, or the ability to safely migrate the myriad of production or otherwise in use solutions over to the ideal configuration. We can all try to reach this nirvana, but its simply not as attainable to most of us as we'd like.

We can, however, address some of the pain of the disaster recovery scenario from disk based solutions. The iSCSI and SAN vendors have been on this for some time, and have extolled the virtues of block based storage. Using such, you can stream I/O at near the theoretical limits of the hardware. However, running all your systems against a SAN throws you down the path of the usual hardware based solution to a general problem, with the usual vendor lock-in quibbles. We already have bought into the software based approach that NexentaStor has offered us, and happily, they already provide a similar solution to our needs. With thin-provisioning of virtualized storage volumes (zvols), one can expose block level storage to clients but still treat them as snapshot capable files, use file level services and such on the back end second level NAS. The clients will generally access these through iSCSI, and they can either directly depend on these network-based volumes as if they were local filesystems, or simply use their filesystem native dump programs to periodically maintain a near synchronized copy of a true DAS filesystem to a second tier block level copy. The latter is nice as it doesn't place undue strain on the back end storage architecture to service all clients in parallel at the fullest of performance for production. We just use network and storage resources for backup.

What does this solve? In the case of the disaster recovery, the reverse backup process can be done, getting streaming I/O rates, perhaps as high as 100+MB/sec over gigabit ethernet when the local arrays fail. In the case of my mail spool filesystem, we recovered at rates of 25-30MB/sec instead of the 500-800K/sec we saw. Even if its not the most up to date copy, if one also did file-level backup of the underlying file system to NexentaStor or the like at a faster interval, you can recover from that incrementally after the first block level recovery. Either way, you taste the sweetness of success. Again, the dirt is stuffed under the rug if this is hundreds of terabytes, and some day soon that may also be just as common place, but perhaps we are again ahead of the curve. The one side affect is that you'll want more cheap storage readily available on the second tier.

I'll quickly describe the second scenario, where a failing system also needed its 1TB of mainly larger files migrated. We saw that our top rates of file level recovery at best were 1GB/minute, but generally less. Again, it would have made sense to have been redundant up front, but the same solution above could more than double the rate of recovery if we could restore the primary file system at block speeds. This is similar to how virtual machines are managed from SAN, iSCSI, or even NFS. The VMs themselves are represented as files, and so operations on these files approach the maximum speed of block storage operations. However, having them on that NAS allows ease of sharing and management, including snapshots. No hardware tricks, all software. We haven't addressed the next stumbling blocks, which include kernel page size limitations on true file I/O, but the dirt is nicely hidden for the time being.