Saturday, May 03, 2008

ZFS: Is the ZIL always safe?

One of my ZFS-based appliances, used for long term backup, was upgraded from B70 to B85 of OpenSolaris two weeks ago. This time around, I re-installed the system to get RAIDZ2, and certain "hacks" that I've been using were no longer in place. The old settings were in /etc/system, and are the well known zil_disable and zfs_nocacheflush enabling. They were left there from when the system temporarily acted as a primary server for a short time with its Adaptec (aac) SATA RAID card and its accompanying SATA-I drives. Since the unit was UPS attached, it was relatively safe for NFS client access, and later on there was no direct client access over NFS. No harm done, and stable for quite some time over multiple upgrades from B36 or so, over a year without an error.

A curious thing happened as soon as I upgraded without these somewhat unsafe settings for the kernel. I started to get tons of errors and twice my pool as gone completely offline until I cleared and scrubbed it. An example of the errors:

NAME STATE READ WRITE CKSUM
tier2 DEGRADED 0 0 0
raidz2 DEGRADED 0 0 0
c1t1d0 FAULTED 0 64 0 too many errors
c1t2d0 DEGRADED 0 46 0 too many errors
c1t3d0 DEGRADED 0 32 0 too many errors
c1t4d0 DEGRADED 0 47 0 too many errors
c1t5d0 DEGRADED 0 39 0 too many errors
c1t6d0 FAULTED 0 118 0 too many errors
c1t7d0 DEGRADED 0 57 0 too many errors


Nothing explained the turnaround from stable to useless for any writes. I also got some read errors, and no nightly rsync against this tree would survive without incrementing some error count. Was it somehow one of my cache settings on the adaptec card that conflicted with a new version of the "aac" driver? I thought I would need to isolate it, revert perhaps, or consider that somehow my card was simply dying. Perhaps the cache/RAM on the card itself was toast.

A recent post on the opensolaris-discuss mailing lists gave me an idea. Mike DeMarco suggested to a user suffering from repeated crashes that corrupt ZFS until cleared to try and use zil_disable to test "if zfs write cache of many small files on large FS is causing the problems." Makes some sense if the card is somehow trashing on small writes. The use of it for backup means that its being read and written to via rsync and can involve many small updates. I also had various read errors pop up. So, I put the old faithful zil_disable and for good measure the zfs_nocacheflush back after another degraded pool, and after a reboot and scrub, let it do its nightly multi-terabyte delta rsyncs. After a few days, there are no errors. Have I stumbled onto some code path bug that is ameliorated by these kernel options? Do newer kernels have suspect aac drivers?

Perhaps someone will prove the logic of the above all wrong, but for now, I'm returning to the old standby "unsafe" kernel options to keep my pool stable.

Labels: , , ,

Thursday, April 03, 2008

OpenDirectory upgrade path from 10.4 to 10.5

In EE we've migrated over from various AD and OpenLDAP installations to what we hope is a more manageable solution long term. Sadly, upgrading OpenDirectory (MacOSX OpenLDAP-based directory services) from 10.4 to 10.5 doesn't work as Apple states it would. Here's the complete recipe we used to keep our data, our passwords, and most importantly, our domain SID. Apple tends to not care about maintaining the SID in various replica-to-master promotion steps.

First, a reference to the cookbook  doing things the hardway.

As recommended in the above and from other postings, upgrades do not work. Rather, what needs to be done is this:

10.4 Server:

1) go to Server Admin, OpenDirectory, and under the Archive tab, generate an archive of the OpenDirectory DB. Place in admin home directory
2) For safe keeping, go to /var/db/samba and get the secrets.tdb file. Place in admin home directory (readable by all)
3) get the current SID by running as root/sudo "net getdomainsid EE" where EE is the domain we are supporting. Place in home directory
4) copy off to a 3rd party machine the above three files/directories

10.5 Server:

1) Install fresh, and use the exact same IP and name as the 10.4 Server. You'll likely need to have these are their own net. Also note that without a link on the primary interface, smb, dns, and opendirectory don't work. I suggest connecting to the third party machine listed above, in my case my laptop's physical connection which I assign to the private net
2) You'll need DNS setup temporarily, so create a DNS server for your domain (stanford.edu) and create a host entry for your self. Point local network settings to self as DNS server
3) copy over the files saved from 10.4 from the laptop/3rd party machine
4) Make an OpenDirectory Master, using the correct domain "dc=ee,dc=stanford,dc=edu" and correct KRB realm "EE.STANFORD.EDU"
5) import the archive of 10.4
6) run as root "mkpassdb -kerberize"
7) Create a new PDC config for Windows. Use the directoryadmin account/password to give samba correct access to the OpenDirectory DB
8) edit /var/db/smb.conf to fit the /etc/smb.conf entries you had on 10.4. Likely you'll want to make "local path = " and add "admin users = directoryadmin, domainjoin, @admin" or the like, where the first is the directory admin acct, the second is a PDC join account that can't login, but has directory admin rights. @admin works to include anyone in admin group
9) run as root "chflags uchg /var/db/smb.conf" to freeze your samba config. Recommend making a copy as well in the same dir.
10) run as root "net setdomainsid (SID)" where SID is the one you saved from 10.4
11) Go into Workgroup Manager. Change preferences to enable Inspector. Go into Inspector and select "Config" and then "CIFSServer". The two Value lines with "xml version.." need to have Edit run against them, and replace the SID line in each with the SID you just used.
12) restart Samba/Windows services. Check SID with, as root, "net getdomainsid" and "net getlocalsid EE" or the like. If anything didn't stick, do 10, 11 again.
13) before going live, one needs to remove reference to the local DNS in Network preferences, and optionally disable DNS service. This setup also was only tested with Wins service enabled as the WINS Server
14) test, test, test from Windows including domain logins, enumeration of groups in windows for adding domain users, etc. Logs may show if accounts are failing.

On Windows, the simple tests you can do involve the utility "nltest" which is in the free SUPPORT TOOLS (but may not be installed by default). nltest /? gives commands although OS-X samba only supports some of them.

..to list PDC and BDCs --- nltest /dclist:your_domain

nltest /dclist:ee
Domain 'ee' is pre Windows 2000 domain. (Using NetServerEnum).
List of DCs in Domain ee
\\EE-OD (PDC)
The command completed successfully

..to verify schannel --- nltest /sc_query:your_domain
C:\>nltest /sc_query:ee
Flags: 0
Trusted DC Name \\EE-OD
Trusted DC Connection Status Status = 0 0x0 NERR_Success
The command completed successfully

To do a more detailed check, you can open the Windows Manager and try to look at the members of the Administrator group for the machine. When we had trouble, it just showed raw SID numbers, even for EE\DomAdmins. Once it was fixed, then that showed correctly.

Error cheat sheet:

1. If smb logs show that directoryadmin or domainjoin and the like have the "wrong sid" in passdb, you'll need to demote/promote Windows Servers to workgroup and back to PDC. You'll need to run "chflags nouchg /var/db/smb.conf" first and copy back your copied version after repromotion as the file will be rewritten. Do step 9-12 again above

2. If kerberos isn't effectively working on clients, you may need to reimport the archive OpenDirectory, rerun "mkpassdb -kerberize" and follow the above demote/promote steps.

Labels: , , , , , , , ,

Have NAS, Want Shell

Now that anyone can grab Nexenta's NAS product, many will undoubtedly want to get under the hood, especially developers. First, a fair warning that although the management infrastructure is resilient to many changes done manually, modifying various service configurations outside of Nexenta's internal version control can lead to one or two headaches if you aren't careful. That said, give me a shell!

Well, that's simple. When you login via the console (ssh, for example), simply run "setup appliance nmc edit-seettings". You can tab your way through that command as well. Once there, go and edit expert_mode to be "1". Yes, you've enter the "vi" command zone, so save and exit with ':wq'

Once the changes are saved, you'll be asked to refresh the console settings, and now you can type "!bash" to get a nice usable shell, or bang escape any command. You'll be root, so, be aware and behave! Now you know what Nexenta Core was all about, as its all there at your fingertips, along with NMS, NMC, and NMV subsystems that are the foundation of the NAS product.

update:
I was told that an alternative way to set expert mode is
option expert_mode = 1 -s
as denoted in the "option -h" documentation for NMC. The "-s" flag updates the on-disk configuration.

Labels:

Developers, developers, developers...

Ever wanted that NAS on your own hardware, for free? Nexenta has finally released their NexentaStor Developer Edition 1.0, which is free version of their commercial product with only a 1TB limit on used storage. All functionality otherwise is there, unlimited. This is a near final release for the commercial version, but is the first version the general public can get and install on their own hardware.

The release represents many things, but the Developer releases are focused on more than just tire kicking or a free NAS product for your home NAS needs. Rather, there is a lot of potential to extend and use Nexenta's SA-API for storage service-enabled solutions. Wish to modify your DB to wrap a transaction in a snapshot? Need to automate separate file system creation, quotas, etc for your users? Registered users on the web site can look at an overview of the architecture and sample SA-API components. I expect much more in the way of API details in the near future. However, the release of the product is here today.

A general support forum is also available

There are two other automation aspects to NexentaStor that I haven't given much love to here. Both utilize the batch nature of NMC, the Nexenta Management Console. One is the 'query' functionality, which allows various introspections on the NAS and can query across multiple appliances at once if they are grouped together (the group function). In a similar vein, there is the NMC recording facility, which is handled by the "record" command. Recording allows you to save and play back actions for various tasks, including over a network of NAS devices. All of these commands have ready examples available by invoking the command with the "-h" help argument in the console. There is also good stuff in the User Guide which is available for download.

Labels: , ,

Friday, March 21, 2008

Step by Step CIFS Server setup with OpenSolaris

After CIFS Server was released into the OpenSolaris wild, I could not for the life of me get it to work. Even in the post B82 stage, the random collection of documentation led me astray multiple ways. I think part of the problem is that I read up on it too much and thus old requirements were no longer accurate and got in the way. You need to setup your krb5.conf file right? LDAP too? The final resolution appears to be rather straight forward, and it likely shows other steps I had taken previously were left rotting on my system and prevented a working solution.

So, what do you actually need? I'd recommend starting with at least B85. In my case I used the latest NexentaOS unstable release (1.0.1 to be) which includes B85 and by default the necessary Sun smb packages. For my test, I created a contrived domain using Windows 2003 Server (SP2) called WIN.NEXENTA.ORG. The rest follows:

add to /etc/resolv.conf:
nameserver 172.24.101.71
domain win.nexenta.org
search win.nexenta.org
(Nameserver is our AD DNS server)

(optional: run ntpdate against your time server)
#svcadm enable svc:/network/ntp:default
#svcadm enable -r smb/server
#smbadm join -u Administrator win.nexenta.org

#zfs set sharesmb=on data/myshare
#zfs set sharesmb=name=myshare data/myshare

#mkdir /data/myshare/jlittle
#chown jlittle /data/myshare/jlittle

#idmap add 'winuser:*' 'unixuser:*'
#idmap add "wingroup:Domain Users' 'unixgroup:staff'

#svcadm restart smb/server
#svcadm restart idmap

Other advisable steps include "zfs set casesensitivity=mixed data/share" for correctness of Windows users, but likely not ideal if the zfs filesystem shared is also shared to NFS clients. You know if its all working if "idmap dump" gives you real values and not just returns to the prompt. I connected to my new share via a MacOSX client, and made sure my domain matched as "win.nexenta.org" when connecting to my share (aka smb://server/myshare/jlittle).

In the end, it was much simpler than the documents suggested. I had to avoid explicitly stating the domains in idmap to make idmap do the right thing. You should pick the right local group for your users in the mapping for groups. I picked "staff" as that was the default group of my user.

Labels: , , ,

Monday, March 03, 2008

Random Storage Comments, Answered

In my last posting, a lot of comments covered wide and varied ground. First, its important to note that even with CDP underlying ZFS pools, ZFS itself provides for its own integrity of state.  If CDP didn't complete a transaction, a re-sync will generally resolve it, but the actual hosted ZFS filesystem need not fear and its transactions won't be finished until the write is checksumed. I agree that there are failure modes here, but that leads to a good quote in one of the comments:

"To that end, it seems to be that whenever a choice can be had between doing something simple to accomplish a goal and chaining a bunch of parts together to accomplish the same goal with more sophistication, its likely the simpler solution will be more sustainable over time."

I concur. Nexenta marries two pieces of functionality to get auto-cdp, and they rely on the two components in whole to maintain overall simplicity of implementation. The real value that they have provided is in making the front end dead simple. If the management isn't simplified, any level of underlying functionality will be lost in the long run.

I want to focus more on the simplicity of the performant NAS solutions. Mentioning pNFS, lustre, and the like, we know that the client becomes a bit less transparent, and definitely the backend store of data becomes somewhat opaque as data is no longer consistent per one server, but is spread out across the whole back end. Even though you need newer clients with specific functionality in both cases, it can again be more simple than the alternative, which generally involves an NFS v3 client using automounts, LDAP-based auto mount maps, and heavy handed data management on the backend to scale out in similar ways over multiple NAS heads. The tact of taking a single high end head with best of breed backend hardware, such as IB interconnects to SAS disk arrays and 10GB ethernet out the front might seem to work, but we have already seen pathological conditions where a single heavy client writing millions of small files can make that enhanced hardware meaningless for performance.


There is no fast answer to solving both scale out with regards to capacity and performance without a little give on each aspect of the design. What makes it all reasonable to consider is if the entire solution is made greatly more simple to manage than the alternatives at either end of the design spectrum. In the end, simplicity of manageability will trump other considerations. As long as simplicity is strictly maintained in the product, the underlying complexity will seem well worth the effort. We just need to trust that someone gets the fine details. In the end, we don't mind that we can't muck much with a highly efficient but high performance car. As long as we feel mastery over its operation and trust in the quality of the build and service by the manufacturer, we all are willing to make the investment.

Labels:

Thursday, February 07, 2008

The ZFS scaling and DR question

In my dealings with using ZFS-based NAS and 2nd tier solutions, I've been blessed to hear from different people with thoughts that push the discussion forward. The ZFS space is where a lot of long term strategies that utilized commoditized components are covered. Other spaces that I follow seem to be somewhat stale, or consider specialized point systems that solve problems between just two or three chosen vendors. ZFS is open, so I think its ecosystem can only grow. I'm happy to have permission from Wayne Wilson to re-state one of his emails, and use it for a discussion point.




I see two different paths to scale:

1) Use a single system to run NexentaStor and have it mount
the remaining systems via iSCSI.  This would probably have to
be done over 10 Gbs or infiniband links.

2) Create Multiple NexentaStor systems and only present them
as a unified file system via NFS V4.  This would leave the
CIFS clients restricted to mounting multiple volumes, but that
may be ok.


Is this it or is there some other way?

Next architectural issue is how to do DR and archiving.

There are two types of DR - Human initiated and machine initiated.

My standard strategy for the Human initiated DR is to use
snapshots and keep enough of them around to answer most restore
requests. For machine initiated, my worst case is when the
storage subsystem (either a complete Thumper or a complete array)
fails.  For this I can find no other solution than to replicate.

  As you have pointed out, replication usually locks you
into the backend storage vendors system, whereas it would be
better for us consumers to be able to 'mirror' across two
disparate storage backends.

Here is where things might fall apart in using Thumpers. We
could probably spec out a really high I/O throughput 'head end'
type server to load Nexenta on.  Then we could present any kind
of storage as iSCSI LUN's to the system. Let's say we use a
Thumper 48TB system to as an iSCSI target for our head end, then
we could use an Infortrends (or some such) SATA/iSCSI array for
other LUN's and let ZFS mirror across them.......and rely on
snapshots for human based DR and mirror failure protection for
machine based DR.


Then that leaves us with Archiving. I think that here is where
a time based tier, or at least the ability to define a tier based
on age would be useful.  If we set the age to a point beyond which
most changes are taking place (letting snapshot's take the brunt
of the change load before archiving), then it is likely that we
will have just one copy of the data to archive off, for most files.

What we would want to do is the make tape the archival tier. I am
uncertain as to how to do this.  Should it be done using vendor
backup software to allow catalogs and GUI retrievals?



Wayne covers how to scale out NAS heads based on ZFS, as well as the standard DR and archive question.

Considering the NAS head scenario, it is plain that long term, running storage across multiple NAS heads using upcoming technologies as they mature, such as Lustre or pNFS, will be necessary to scale out with both performance and capacity. However, it is reasonable to consider solutions that utilize DAS/IB/external SAS/iSCSI to take one head node and approach petabyte levels. I consider this within reason if the target is second tier or digital archive storage, where performance isn't king. With the best of hardware, perhaps the single head node (or HA config) will have sufficient performance for most primary storage deployments. Time will tell as our needs require such solutions and technologies we have employed improve.

Disaster Recover I brought up in a recent post. Using file based backup solutions works well in the act of backing up, but restoration at a large scale is wanting especially if file numbers become more dominant than file size. The next beta of NexentaStor happily has taken a large leap forward in addressing this by implementing a very simply to manage auto-cdp (continuous data protection) service across multiple NAS heads. This keeps multiple storage solutions in sync as data is committed, operates below the ZFS layer, and is bidirectional. Yes, the secondary system can import the same LUNs or ZFS pools and re-export them to your clients. Just as important, if you lost the primary host, synchronization can be reversed to restore at full block level speeds your primary system.

If you take this approach, and also consider exposing native zvols or NFS/CIFS to your clients (such as a mail server), they too can use their local DAS storage under any OS and filesystem, but they can use native backup solutions to the ZFS exported volumes to regularly backup block-level dumps to allow speedy block-level restores. A mixture with this and file level backups even permits less frequent full dumps and greater granularity in recovery. In the end, you'd hope to have these wonderful features on your server OS directly to prevent having to do DR, but you can see that we are approaching reasonable answers.

The final issue brought up is archival, and I hope my previous posts have gone far in answering it. In general, I believe disk based archival solutions need to be employed before tape is considered, and tape should be fully regulated to final archival stages only. Today, you can use multiple open (Amanda/Bacula) and closed backup software solutions to write to tape libraries from trailing edge snapshots. I also know that though in its infancy, the NDMP client services evolving for ZFS will someday allow easier integration into current backup systems, allowing most people to convert existing tape based solutions completely into their last tier archive, running infrequently for long periods with just full backups.

All the above is just my "its getting better" perspective. Perhaps you can find some glaring weakness. I hope shortly you can all see the auto-cdp service that Nexenta has put together in action. Its well worth the wait.

Thursday, January 17, 2008

Using the iRam: Improving ZFS perceived transaction latency

I've been long overdue in reviewing the Gigabyte iRam card and its affect on performance of your favorite ZFS NAS product. NexentaStor already supports log devices, so the time appeared right to get one of these for a client I consult with to help deal with the noticeable pauses one can see when heavy reads and writes compete on a ZFS pool. I hope that the single threaded nature of those commits is resolved at some future point, but the iRam card appears to be a simple way to inject an NVRAM-like device into your commodity NAS solution.

The card itself is simply four DIMM sockets for DDR RAM, with a battery backup, reset switch, power driven from a PCI bus, and a single SATA-I connection to plug the unit into your existing SATA interfaces. Already you can see that the performance limit is 150MB/sec based on the SATA-I spec. What does this card do though? Near instant reads and writes in a safe battery-backed ramdisk that your system sees as a 2GB or 4GB drive, just what you'd want for a dedicated write commit device. In the case of many spindles in an array, you likely can do better than this device for true performance, but in the case of many small commits, the near perfect latency of RAM is much more ideal to keep writes happening without stalling the drives for reads. Since its a "slog" device by ZFS terms, it will regularly commit to the real underlying storage at full disk bandwidth. Therefore, even when writes must compete with reads on the physical disk, you limit your exposure to perceived stalls in I/O request even in the higher load cases.

For my non-production test, I actually put together the worse case scenario: An iSCSI backed ZFS array with NFS clients and many small files. In this case, any NFS writes require 3 fsyncs on the back end storage as required by NFS (create,modify,close). This is actually similar to CAD libraries, which the test was made to reflect. Using iSCSI devices, you can inflict much higher latencies. My iSCSI targets are actually older SATA-I drives themselves on a SBEi Linux based target using 3ware 8500s. Again, no where near ideal.

Creating a directory of 5000 small 8k files, I copied this from a linux gig-e connected client to a ZFS pool (made of two non-striped iSCSI luns), and got a meager 200K/sec write performance over NFS. If I stripe the data instead in the ZFS pool, I increased the numbers to 600K/sec at some points. Adding a 2GB Gigabyte iRam drive, I increased those numbers up to 9MB/sec, but averaging around 5MB/sec overall. That's at least 10 times the performance. Again, this test involves many i/o operations instead of using any bandwidth.

How fast can data be written to and read from that log device? My tests showed that 100MB/sec for reads and writes were common, with writes only bursting to those numbers for larger streaming data sets. In the case of the iSCSI nodes in question, each one could be pulled at a top rate of 45MB/sec, but averaging closer to 27MB/sec. Nominally, you can see that we are 3x better than at least these gig-e iSCSI devices.

The final production installation of the iRam device was with a SATA-II DAS array, and even in heavier load scenarios, we saw the wait cycle for write commits to the drives limited, and a steady 100+MB/sec use of the commit log (reads and writes). The only caveat for using such a device is that the current builds of OpenSolaris and thus NexentaStor do not allow you to remove it once added to a pool. A future release is supposed to address that.

Labels: , , , , , ,

Friday, January 11, 2008

Swept Under the Rug

In our day to day management of technology, we tend to pick paths that resolve the most pressing pain points. Inadvertently, we often also sweep certain problems under the rug, awaiting the day when it all must be cleaned up. Many choices do exactly this, solving the present problem while creating perhaps larger problems down the road. In my evolving strategy on storage, the move away from tape to disk-based online storage solves the most obvious problems but creates a whole series of other problems, including file based disaster recovery, long term maintenance of the underlying disk technology, true long term persistence of data, and general accessibility of the data by future technology. Today, I'll focus only on our next major pain point, disaster recovery.

Recently, a few instances occurred that underscored the need for better thought out solutions than what we already put in place. We thought we may be ahead of the curve with tiered copies of data on secondary NAS solutions, with our backup windows well within reason. Its obvious we made the right choice in doing incremental file based backups to secondary NAS, as the end data containers are universal across network file protocols. Recovery of any given file or perhaps even full data store recovery still beats that of tape libraries multiple times over. However, the architecture in place has allowed us to scale from the gigabyte world to the terabyte world. Our backup windows are well in hand, and spot recovery is a cinch. But there are some problematic disaster recovery scenarios.

The first scenario was just felt a week ago. A mere 50GB file store of Maildir formatted mail, where each message is a file itself, with mail folders represented by directories, had write errors on its underlying Linux XFS volume. This is by far not our largest install of such. Various mail servers for separate organizations we deal with are over 500GB in size. We suspect the RAID card's NVRAM was toast, disallowing further writes, and we had to migrate the mail to another server quickly. Simple enough, let's recover from our second tier mail store, right? The attempt was made, but we found ourselves limited not by the reading of millions of small 1K files so much as recommitting those files onto a journaled filesystem. The metadata updates of the files alone were bad enough. In the end, we were limited by file operations per second, and not pure bandwidth to the disk. Our estimated time of recovery was a minimum of 14 hours, and only for 50GB. A clue to the long term solution to this was in how we restored everything in less than 2 hours. In this case, we relied upon an xfsdump from the read-only failing array to a new filesystem on the spare hardware.

The obvious up front answer to disaster recovery of data in a multi-terabyte world is to make sure you have copies of everything in as close to a high availability setup as you can afford. If the underlying RAID array was actually two arrays, with software mirroring across the two, or if it were two separate machines that either attached to shared mirrored arrays or otherwise mirror their underlying RAID arrays over the network, we'd all just worry about natural disasters. Preventing the true disaster recovery scenario up front is the only true way to win, but most of us simply don't have the luxury, the resources, or the ability to safely migrate the myriad of production or otherwise in use solutions over to the ideal configuration. We can all try to reach this nirvana, but its simply not as attainable to most of us as we'd like.

We can, however, address some of the pain of the disaster recovery scenario from disk based solutions. The iSCSI and SAN vendors have been on this for some time, and have extolled the virtues of block based storage. Using such, you can stream I/O at near the theoretical limits of the hardware. However, running all your systems against a SAN throws you down the path of the usual hardware based solution to a general problem, with the usual vendor lock-in quibbles. We already have bought into the software based approach that NexentaStor has offered us, and happily, they already provide a similar solution to our needs. With thin-provisioning of virtualized storage volumes (zvols), one can expose block level storage to clients but still treat them as snapshot capable files, use file level services and such on the back end second level NAS. The clients will generally access these through iSCSI, and they can either directly depend on these network-based volumes as if they were local filesystems, or simply use their filesystem native dump programs to periodically maintain a near synchronized copy of a true DAS filesystem to a second tier block level copy. The latter is nice as it doesn't place undue strain on the back end storage architecture to service all clients in parallel at the fullest of performance for production. We just use network and storage resources for backup.

What does this solve? In the case of the disaster recovery, the reverse backup process can be done, getting streaming I/O rates, perhaps as high as 100+MB/sec over gigabit ethernet when the local arrays fail. In the case of my mail spool filesystem, we recovered at rates of 25-30MB/sec instead of the 500-800K/sec we saw. Even if its not the most up to date copy, if one also did file-level backup of the underlying file system to NexentaStor or the like at a faster interval, you can recover from that incrementally after the first block level recovery. Either way, you taste the sweetness of success. Again, the dirt is stuffed under the rug if this is hundreds of terabytes, and some day soon that may also be just as common place, but perhaps we are again ahead of the curve. The one side affect is that you'll want more cheap storage readily available on the second tier.

I'll quickly describe the second scenario, where a failing system also needed its 1TB of mainly larger files migrated. We saw that our top rates of file level recovery at best were 1GB/minute, but generally less. Again, it would have made sense to have been redundant up front, but the same solution above could more than double the rate of recovery if we could restore the primary file system at block speeds. This is similar to how virtual machines are managed from SAN, iSCSI, or even NFS. The VMs themselves are represented as files, and so operations on these files approach the maximum speed of block storage operations. However, having them on that NAS allows ease of sharing and management, including snapshots. No hardware tricks, all software. We haven't addressed the next stumbling blocks, which include kernel page size limitations on true file I/O, but the dirt is nicely hidden for the time being.

Labels: , , , , ,