Thursday, February 07, 2008

The ZFS scaling and DR question

In my dealings with using ZFS-based NAS and 2nd tier solutions, I've been blessed to hear from different people with thoughts that push the discussion forward. The ZFS space is where a lot of long term strategies that utilized commoditized components are covered. Other spaces that I follow seem to be somewhat stale, or consider specialized point systems that solve problems between just two or three chosen vendors. ZFS is open, so I think its ecosystem can only grow. I'm happy to have permission from Wayne Wilson to re-state one of his emails, and use it for a discussion point.

I see two different paths to scale:

1) Use a single system to run NexentaStor and have it mount
the remaining systems via iSCSI.  This would probably have to
be done over 10 Gbs or infiniband links.

2) Create Multiple NexentaStor systems and only present them
as a unified file system via NFS V4.  This would leave the
CIFS clients restricted to mounting multiple volumes, but that
may be ok.

Is this it or is there some other way?

Next architectural issue is how to do DR and archiving.

There are two types of DR - Human initiated and machine initiated.

My standard strategy for the Human initiated DR is to use
snapshots and keep enough of them around to answer most restore
requests. For machine initiated, my worst case is when the
storage subsystem (either a complete Thumper or a complete array)
fails.  For this I can find no other solution than to replicate.

  As you have pointed out, replication usually locks you
into the backend storage vendors system, whereas it would be
better for us consumers to be able to 'mirror' across two
disparate storage backends.

Here is where things might fall apart in using Thumpers. We
could probably spec out a really high I/O throughput 'head end'
type server to load Nexenta on.  Then we could present any kind
of storage as iSCSI LUN's to the system. Let's say we use a
Thumper 48TB system to as an iSCSI target for our head end, then
we could use an Infortrends (or some such) SATA/iSCSI array for
other LUN's and let ZFS mirror across them.......and rely on
snapshots for human based DR and mirror failure protection for
machine based DR.

Then that leaves us with Archiving. I think that here is where
a time based tier, or at least the ability to define a tier based
on age would be useful.  If we set the age to a point beyond which
most changes are taking place (letting snapshot's take the brunt
of the change load before archiving), then it is likely that we
will have just one copy of the data to archive off, for most files.

What we would want to do is the make tape the archival tier. I am
uncertain as to how to do this.  Should it be done using vendor
backup software to allow catalogs and GUI retrievals?

Wayne covers how to scale out NAS heads based on ZFS, as well as the standard DR and archive question.

Considering the NAS head scenario, it is plain that long term, running storage across multiple NAS heads using upcoming technologies as they mature, such as Lustre or pNFS, will be necessary to scale out with both performance and capacity. However, it is reasonable to consider solutions that utilize DAS/IB/external SAS/iSCSI to take one head node and approach petabyte levels. I consider this within reason if the target is second tier or digital archive storage, where performance isn't king. With the best of hardware, perhaps the single head node (or HA config) will have sufficient performance for most primary storage deployments. Time will tell as our needs require such solutions and technologies we have employed improve.

Disaster Recover I brought up in a recent post. Using file based backup solutions works well in the act of backing up, but restoration at a large scale is wanting especially if file numbers become more dominant than file size. The next beta of NexentaStor happily has taken a large leap forward in addressing this by implementing a very simply to manage auto-cdp (continuous data protection) service across multiple NAS heads. This keeps multiple storage solutions in sync as data is committed, operates below the ZFS layer, and is bidirectional. Yes, the secondary system can import the same LUNs or ZFS pools and re-export them to your clients. Just as important, if you lost the primary host, synchronization can be reversed to restore at full block level speeds your primary system.

If you take this approach, and also consider exposing native zvols or NFS/CIFS to your clients (such as a mail server), they too can use their local DAS storage under any OS and filesystem, but they can use native backup solutions to the ZFS exported volumes to regularly backup block-level dumps to allow speedy block-level restores. A mixture with this and file level backups even permits less frequent full dumps and greater granularity in recovery. In the end, you'd hope to have these wonderful features on your server OS directly to prevent having to do DR, but you can see that we are approaching reasonable answers.

The final issue brought up is archival, and I hope my previous posts have gone far in answering it. In general, I believe disk based archival solutions need to be employed before tape is considered, and tape should be fully regulated to final archival stages only. Today, you can use multiple open (Amanda/Bacula) and closed backup software solutions to write to tape libraries from trailing edge snapshots. I also know that though in its infancy, the NDMP client services evolving for ZFS will someday allow easier integration into current backup systems, allowing most people to convert existing tape based solutions completely into their last tier archive, running infrequently for long periods with just full backups.

All the above is just my "its getting better" perspective. Perhaps you can find some glaring weakness. I hope shortly you can all see the auto-cdp service that Nexenta has put together in action. Its well worth the wait.