Thursday, February 07, 2008

The ZFS scaling and DR question

In my dealings with using ZFS-based NAS and 2nd tier solutions, I've been blessed to hear from different people with thoughts that push the discussion forward. The ZFS space is where a lot of long term strategies that utilized commoditized components are covered. Other spaces that I follow seem to be somewhat stale, or consider specialized point systems that solve problems between just two or three chosen vendors. ZFS is open, so I think its ecosystem can only grow. I'm happy to have permission from Wayne Wilson to re-state one of his emails, and use it for a discussion point.

I see two different paths to scale:

1) Use a single system to run NexentaStor and have it mount
the remaining systems via iSCSI.  This would probably have to
be done over 10 Gbs or infiniband links.

2) Create Multiple NexentaStor systems and only present them
as a unified file system via NFS V4.  This would leave the
CIFS clients restricted to mounting multiple volumes, but that
may be ok.

Is this it or is there some other way?

Next architectural issue is how to do DR and archiving.

There are two types of DR - Human initiated and machine initiated.

My standard strategy for the Human initiated DR is to use
snapshots and keep enough of them around to answer most restore
requests. For machine initiated, my worst case is when the
storage subsystem (either a complete Thumper or a complete array)
fails.  For this I can find no other solution than to replicate.

  As you have pointed out, replication usually locks you
into the backend storage vendors system, whereas it would be
better for us consumers to be able to 'mirror' across two
disparate storage backends.

Here is where things might fall apart in using Thumpers. We
could probably spec out a really high I/O throughput 'head end'
type server to load Nexenta on.  Then we could present any kind
of storage as iSCSI LUN's to the system. Let's say we use a
Thumper 48TB system to as an iSCSI target for our head end, then
we could use an Infortrends (or some such) SATA/iSCSI array for
other LUN's and let ZFS mirror across them.......and rely on
snapshots for human based DR and mirror failure protection for
machine based DR.

Then that leaves us with Archiving. I think that here is where
a time based tier, or at least the ability to define a tier based
on age would be useful.  If we set the age to a point beyond which
most changes are taking place (letting snapshot's take the brunt
of the change load before archiving), then it is likely that we
will have just one copy of the data to archive off, for most files.

What we would want to do is the make tape the archival tier. I am
uncertain as to how to do this.  Should it be done using vendor
backup software to allow catalogs and GUI retrievals?

Wayne covers how to scale out NAS heads based on ZFS, as well as the standard DR and archive question.

Considering the NAS head scenario, it is plain that long term, running storage across multiple NAS heads using upcoming technologies as they mature, such as Lustre or pNFS, will be necessary to scale out with both performance and capacity. However, it is reasonable to consider solutions that utilize DAS/IB/external SAS/iSCSI to take one head node and approach petabyte levels. I consider this within reason if the target is second tier or digital archive storage, where performance isn't king. With the best of hardware, perhaps the single head node (or HA config) will have sufficient performance for most primary storage deployments. Time will tell as our needs require such solutions and technologies we have employed improve.

Disaster Recover I brought up in a recent post. Using file based backup solutions works well in the act of backing up, but restoration at a large scale is wanting especially if file numbers become more dominant than file size. The next beta of NexentaStor happily has taken a large leap forward in addressing this by implementing a very simply to manage auto-cdp (continuous data protection) service across multiple NAS heads. This keeps multiple storage solutions in sync as data is committed, operates below the ZFS layer, and is bidirectional. Yes, the secondary system can import the same LUNs or ZFS pools and re-export them to your clients. Just as important, if you lost the primary host, synchronization can be reversed to restore at full block level speeds your primary system.

If you take this approach, and also consider exposing native zvols or NFS/CIFS to your clients (such as a mail server), they too can use their local DAS storage under any OS and filesystem, but they can use native backup solutions to the ZFS exported volumes to regularly backup block-level dumps to allow speedy block-level restores. A mixture with this and file level backups even permits less frequent full dumps and greater granularity in recovery. In the end, you'd hope to have these wonderful features on your server OS directly to prevent having to do DR, but you can see that we are approaching reasonable answers.

The final issue brought up is archival, and I hope my previous posts have gone far in answering it. In general, I believe disk based archival solutions need to be employed before tape is considered, and tape should be fully regulated to final archival stages only. Today, you can use multiple open (Amanda/Bacula) and closed backup software solutions to write to tape libraries from trailing edge snapshots. I also know that though in its infancy, the NDMP client services evolving for ZFS will someday allow easier integration into current backup systems, allowing most people to convert existing tape based solutions completely into their last tier archive, running infrequently for long periods with just full backups.

All the above is just my "its getting better" perspective. Perhaps you can find some glaring weakness. I hope shortly you can all see the auto-cdp service that Nexenta has put together in action. Its well worth the wait.

4 comments: (Joe Little) said...

For those who care, auto-cdp is now out. Release notes give all the details, including new drivers added!

Wayne said...

My initial read through of the CDP release notes did not answer a large uncertainty that I have about CDP solutions in general. That uncertainty is based around the notion of transactional integrity.
This is thinking more or less from the database world, but I believe that it also applies to block level disk writes. All modern systems seem to have caching, often, multiple layers of cache. If the file system layer thinks it has committed a write, the underlying layers all the way through to the end CDP target also need to have either a commit or a roll forward log to apply on recovery.

This may seem like a small niggling matter and that we can just accept the small risk probabilities here. However, I think a primary use case for CDP is to be able to survive a catastrophic hardware failure on the primary storage and that could likely occur at any time during application layer writes the file system.

Wayne said...

In general I think Joe has answered the questions I originally posed. Some of the answers are that solutions are still evolving.

One thing that I would like to comment on is not a direct response to anything that either Joe or I have said. Rather it is aimed at those people developing the next generation of parts and tools that get coupled together into systems such as NexentaStor.

The issue is one of manageability. Nexenta has done a good job of providing a single point of managment for the parts that they are providing. It's when we start layering on other bits, like file (or block level) backup/restore software, ndmp feeds, cross system file mounts, that the overall picture of what constitutes your 'storage system' becomes fragmented. Fragmentation means hard to manage and hard to understand, it also increases the probability of making mistakes.

To that end, it seems to me that whenever a choice can be had between doing something simple to accomplish a goal and chaining a bunch of parts together to accomplish the same goal with more sophistication, it's likely the simpler solution will be more sustainable over time.

But what if the simpler solution is just not good enough? This is where new innovation is needed. We need sophisticated functionality for sure, but we also need higher productivity from our managment tasks as well!
One without the other will relegate complex albeit high functionality systems to niche markets.

Wayne said...

Yet another comment - scalability:

There are two ways to overcome the limitations of what we have been thinking of as a 'headend'. By headend I mean the server that runs the appliance OS that presents the network file service.

One way is for that headend service to change from NFX/CIFS to a cluster file system such as Lustre as Joe has mentioned.

Yet this requires that each client to the storage add software to present the new file system.

Another approach, already being used in some commercial products is to have the cluster capability in back of the head end. This has the advantage of being transparent to the client, thus no software to install and maintain. I am not sure how this would work in the NFS world, but I do know that it works in the CIFS world by clever use of 'virtual IP' address's and connection re-direction.

As you can tell, I am in favor of the transparent client solution simply because I believe that we should harness the power of our computers and software to make life simpler for us. Fewer parts on the client side scales far better as the number of clients ramps up.