Friday, January 11, 2008

Swept Under the Rug

In our day to day management of technology, we tend to pick paths that resolve the most pressing pain points. Inadvertently, we often also sweep certain problems under the rug, awaiting the day when it all must be cleaned up. Many choices do exactly this, solving the present problem while creating perhaps larger problems down the road. In my evolving strategy on storage, the move away from tape to disk-based online storage solves the most obvious problems but creates a whole series of other problems, including file based disaster recovery, long term maintenance of the underlying disk technology, true long term persistence of data, and general accessibility of the data by future technology. Today, I'll focus only on our next major pain point, disaster recovery.

Recently, a few instances occurred that underscored the need for better thought out solutions than what we already put in place. We thought we may be ahead of the curve with tiered copies of data on secondary NAS solutions, with our backup windows well within reason. Its obvious we made the right choice in doing incremental file based backups to secondary NAS, as the end data containers are universal across network file protocols. Recovery of any given file or perhaps even full data store recovery still beats that of tape libraries multiple times over. However, the architecture in place has allowed us to scale from the gigabyte world to the terabyte world. Our backup windows are well in hand, and spot recovery is a cinch. But there are some problematic disaster recovery scenarios.

The first scenario was just felt a week ago. A mere 50GB file store of Maildir formatted mail, where each message is a file itself, with mail folders represented by directories, had write errors on its underlying Linux XFS volume. This is by far not our largest install of such. Various mail servers for separate organizations we deal with are over 500GB in size. We suspect the RAID card's NVRAM was toast, disallowing further writes, and we had to migrate the mail to another server quickly. Simple enough, let's recover from our second tier mail store, right? The attempt was made, but we found ourselves limited not by the reading of millions of small 1K files so much as recommitting those files onto a journaled filesystem. The metadata updates of the files alone were bad enough. In the end, we were limited by file operations per second, and not pure bandwidth to the disk. Our estimated time of recovery was a minimum of 14 hours, and only for 50GB. A clue to the long term solution to this was in how we restored everything in less than 2 hours. In this case, we relied upon an xfsdump from the read-only failing array to a new filesystem on the spare hardware.

The obvious up front answer to disaster recovery of data in a multi-terabyte world is to make sure you have copies of everything in as close to a high availability setup as you can afford. If the underlying RAID array was actually two arrays, with software mirroring across the two, or if it were two separate machines that either attached to shared mirrored arrays or otherwise mirror their underlying RAID arrays over the network, we'd all just worry about natural disasters. Preventing the true disaster recovery scenario up front is the only true way to win, but most of us simply don't have the luxury, the resources, or the ability to safely migrate the myriad of production or otherwise in use solutions over to the ideal configuration. We can all try to reach this nirvana, but its simply not as attainable to most of us as we'd like.

We can, however, address some of the pain of the disaster recovery scenario from disk based solutions. The iSCSI and SAN vendors have been on this for some time, and have extolled the virtues of block based storage. Using such, you can stream I/O at near the theoretical limits of the hardware. However, running all your systems against a SAN throws you down the path of the usual hardware based solution to a general problem, with the usual vendor lock-in quibbles. We already have bought into the software based approach that NexentaStor has offered us, and happily, they already provide a similar solution to our needs. With thin-provisioning of virtualized storage volumes (zvols), one can expose block level storage to clients but still treat them as snapshot capable files, use file level services and such on the back end second level NAS. The clients will generally access these through iSCSI, and they can either directly depend on these network-based volumes as if they were local filesystems, or simply use their filesystem native dump programs to periodically maintain a near synchronized copy of a true DAS filesystem to a second tier block level copy. The latter is nice as it doesn't place undue strain on the back end storage architecture to service all clients in parallel at the fullest of performance for production. We just use network and storage resources for backup.

What does this solve? In the case of the disaster recovery, the reverse backup process can be done, getting streaming I/O rates, perhaps as high as 100+MB/sec over gigabit ethernet when the local arrays fail. In the case of my mail spool filesystem, we recovered at rates of 25-30MB/sec instead of the 500-800K/sec we saw. Even if its not the most up to date copy, if one also did file-level backup of the underlying file system to NexentaStor or the like at a faster interval, you can recover from that incrementally after the first block level recovery. Either way, you taste the sweetness of success. Again, the dirt is stuffed under the rug if this is hundreds of terabytes, and some day soon that may also be just as common place, but perhaps we are again ahead of the curve. The one side affect is that you'll want more cheap storage readily available on the second tier.

I'll quickly describe the second scenario, where a failing system also needed its 1TB of mainly larger files migrated. We saw that our top rates of file level recovery at best were 1GB/minute, but generally less. Again, it would have made sense to have been redundant up front, but the same solution above could more than double the rate of recovery if we could restore the primary file system at block speeds. This is similar to how virtual machines are managed from SAN, iSCSI, or even NFS. The VMs themselves are represented as files, and so operations on these files approach the maximum speed of block storage operations. However, having them on that NAS allows ease of sharing and management, including snapshots. No hardware tricks, all software. We haven't addressed the next stumbling blocks, which include kernel page size limitations on true file I/O, but the dirt is nicely hidden for the time being.

No comments: