Little Notes

Friday, November 02, 2007

Old is new again

Prior to my happiness in using all things ZFS, I was eagerly testing Lustre. Primary concerns there was both management of the backend storage and ability to translate this network file system to standard NFS and CIFS for non-Linux clients. Cluster FileSystems was recently announced to be acquired by Sun, but prior to that, they made announcements about how Lustre and ZFS would be married. I always thought that perhaps it was that ZFS would be the exposed layer on a client, and that it was the same old OSTs and OSDs on the backend. Not so.

CFS Moves Lustre to ZFS

Instead, ZFS is the backend to the storage nodes. Albeit that the driver pool is somewhat less for OpenSolaris than Linux, it does make one wonder exactly what new flexibility is afforded by this arrangement. Also, how much of this will end up as open sourced code to be incorporated into highly manageable product, such as my new favorite appliance, NexentaStor? I'll be diving further into this as more details emerge. Perhaps others can chime in with more info or clarification.

Thursday, November 01, 2007

The Coming Out Party for Commodity Storage

If you have been following along, I remarked in http://jmlittle.blogspot.com/2007/09/multi-tier-storage-revisited.html that "the increasing capabilities of Nexenta's storage solution and its underlying OpenSolaris base have proceeded a pace, and I believe the future is very bright for this solution". Its one of the few bright spots that I've had the privilege of using to enable commodity-based storage solutions. I've been an early adopter of the NexentaStor multi-tier storage appliance, and I am happy to hear that not only is it approaching its first general release to customers, but a release candidate is being made available to the public. Although I run it directly on hardware, the VMware evaluation version of the product has been deemed fit enough for people to kick the tires and see exactly how this fits in the organization. Check out http://www.nexenta.com

Many will ask how this is different from either the hardware based NAS and Disk-to-Disk solutions, and others will wonder how does this compare to FreeBSD and Linux based solutions and projects already on the market. It comes down to what its does best now, and the potential of where it will go in the future. If you haven't been catching the storage news lately, NexentaStor is the first major product being built on the ZFS filesystem which brings to commodity storage much of what has till now only be accessible by the hardware vendors. Its that secret sauce that has justified those large margins and high priced "vendor-provided and tested" disk drives. What if you could just build it out on your own? Many open source solutions supposedly allow for just that, but its somewhat beyond a do-it-yourself level: the pieces aren't necessarily all integrated, nor is the complete solution truly comparable to commercial solutions or are they production ready. The test is would you feel safe having 50TB of your backups on that solution?

ZFS is all fine and good, but its the integration I speak of that have made me settle on this particular product. It also brings a fully developed commercial grade NFSv4 server solution, fully managed snapshots with the necessary scheduling, multiple replication and tiering services to integrate it anywhere in my digital archive flow, virtualized and thin provisioning, iSCSI target and client support of said storage, and when installed on brawnier hardware architected to grow, it will quickly eclipse many heavily marketed primary storage solutions, at a true fraction of the cost.

Nexenta is building this on OpenSolaris and their own hybrid opensolaris/debian-style distribution. Its has just started to stretch its legs when it comes to potential. However, our use is in second-tier storage, and that truly is where is shines right now. We've already thrown 50TB of disk at this via SCSI, iSCSI, SATA, and the like. It enables reuse of the storage you have now for a credible tiering architecture, and its both the web based interface and extensive command line interface that allow both legacy and new storage components to be managed. I could go to a zetabyte of storage with unlimited snapshots with the current installation, but one would undoubtedly want a more thought out long term hardware architecture. At least the current design allows for phasing in new technology while phasing out the old in the same pools I use today. Long term, I have high hopes that the product further simplifies data growth and management of a multitude of devices.

Now that this is finally available for public consumption, I'll be able to speak more and provide good best practice advice. Here is some ready advice to keep in mind:

1) As per disk capacity grows while prices drop, the exposure window of rebuilding any lost disk makes it more clear that RAID10 provides the best of all worlds for volume growth, redundancy, and recoverability.

2) Don't throw away your primary storage. Its still a mature product, and NexentaStor is best suited to secondary storage at this time. Long term, you can migrate that primary storage into the second tier, managed by NexentaStor. Once you are familiar and comfortable with commodity based storage solutions, you'll find it moving to primary storage environments when its good and ready.

3) That all said, commodity based storage solutions are now here. The wait is over, jump on in today.

Friday, October 12, 2007

Stanford's multi-tier solution: an Example

In my last post covering our year long foray into multi-tier storage, I promised I would detail the specific configuration of systems used at Stanford in production. Prominently featured are a NetApp FAS3050 (28TB raw, standard dual parity RAID4) used as the primary storage, as well as a Sun X4500 utilizing NexentaStor (24TB raw, RAIDZ). We have a secondary NexentaStor head for location independence as well as further expansion with both SCSI and iSCSI attached storage, adding about another 26TB. What's missing from the picture are two important facts. First, other general purpose file servers are also being tiered to the second head. The second point is that one of them is a 4TB NAS unit based on the same NexentaStor product. The various capabilites of Nexenta's product allow it to perform well as primary storage, but in the commodity hardware realm, you are currently lacking in some features the the most discerning of storage customers will still find in most integrated hardware solutions. Time alone will see where the proper mix of hardware and Nexenta's solution end up.

Friday, September 28, 2007

Multi-Tier Storage revisited

Its been a year since I posted here regarding Stanford's current and planned use of various storage solutions to virtually eliminate tape-based nightly archiving. Since then, the industry has gone through various changes, and in some cases, not much change at all.

Specifically, part of our solution used NeoPath's FileDirector for file based virtualization, and SBEI's iSCSI target solution for our backend storage. In the middle, pulling data from our primary NetApp fileservers, was a burgeoning solution being BETA tested at Stanford from Nexenta. So what has changed? NeoPath was acquired by Cisco, with the current product in production ceasing to be supported. SBEI has been acquired by NeoNode, and their iSCSI target, best in class for enabling commodity storage, isn't getting much love. How has Nexenta fared? While we will likely need to migrate away from the other solutions, the increasing capabilities of Nexenta's storage solution and its underlying OpenSolaris base have proceeded a pace, and I believe the future is very bright for this solution.

The NexentaStor product, in early BETA, delivers today on providing a snapshot based large scale file system, utilizing underlying storage pools (iSCSI, SCSI, SATA, FC, etc) and a well developed services architecture including data synchronization and replication, multi-host data tiering, and other facilities with data retention and disaster management to boot. Its base system disks even have bullet proof checkpointing, reversion, and safe updating, redundancy, all in a software solution. The future? Well, its easy to perceive with upcoming NFS v4.1 support that the product can tackle name space virtualization one has found in products such as the NeoPath. Already it can repurpose snapshot-based raw volumes as iSCSI targets, so if the underlying hardware is supported by OpenSolaris, you have an easily managed enterprise-feature level iSCSI target product.

Stanford has over a years worth of second tier data, in both 60 daily and 12 monthly snapshots, tiered from our NetApp. These are within many separate folders, representing over a thousand snapshots per volume. We've recently adopted the Sun X4500 24TB product and have migrated to this ideal solution for quicker disaster recovery. The read speeds on this 48 drive unit are great, and the price point rivals what we've built with iSCSI. Commodity storage is commodity storage, but we continue to utilize iSCSI, DAS (SATA-to-SCSI units), and other additional units to eclipse 48TBs of secondary storage. We have also utilized this solution for one organization as both first and second tier storage, an additional 16TB when we consider their solution, and it has proven its worth both in day to day NAS use as well as some data recovery and full disaster recovery modes.

Now that Nexenta supports some backup software as well as a client, we've only backed up directly to tape from the second tier once per year. We've let our LTO-2 tape library run continuously for around a week just to give us a full archived edition of our data. Are we missing tape reuse, tape-based recovery, or multiple library scheduling (and rescheduling) just to meet an ever growing nightly backup window? Nope. Nexenta looks to be here to stay.

I'll follow up later with specific details on configuration, where I hope things will go, and other random thoughts. On this anniversary, it would appear commodity-based multi-tier storage is practical and readily available.

Saturday, May 12, 2007

CommunityOne

I had the pleasure of presenting the case for Nexenta, an OpenSolaris distribution that combines the best of Solaris with what is best in an Ubuntu/Debian distribution, at CommunityOne last week. My slides are now available online. The demos actually reference Martin Man's great flash demos that he posted at his site.

Wednesday, September 27, 2006

Multi-Tier Storage -- The Commodity Approach

I've been working on some internal documentation explaining some of our long term plans regarding storage. Initially, I imagined two documents, one an executive overview, and another a complete documentation set. Well, as things are hard to write up the first time and maintain, I decided to make one document, moving all the technical and site specific details to appendices. The end result is that I can now post the primary document sans appendices and make it public.

This is the culmination of a multi-year project to move to reliable commodity based storage and get away from nightly tape backup scenarios that do not scale with today's storage growth. So go ahead, and check out my multi-tier storage whitepaper.

Saturday, July 15, 2006

Converting LDAP netgroup entries back to flat file format

I am surprised that no where on the net, someone hasn't posted how to convert back to flat file a netgroup objectclass. This is important for loading this dynamic data back into systems that are themselves relying on static files. You'll need openldap-clients or similar packages (to get ldapsearch). Also, in the below script I expect anonymous read access, and no SASL auth obviously. Finally, the "grep net" part of the netgrouplist is to only grab netgroup names with "net" in them, which is what we have standardized on.

#!/bin/bash
BASE="dc=example,dc=com"
HOST="ldap.example.com"

netgrouplist=`ldapsearch -x -b "$BASE" -h $HOST objectclass=nisnetgroup cn | grep cn: | grep net | awk '{print $2}'`

for i in $netgrouplist
do
echo "$i \\"
ldapsearch -x -b "$BASE" -h $HOST cn="$i" > /tmp/netgrp.$$
dn=`cat /tmp/netgrp.$$ | grep dn`
cat /tmp/netgrp.$$ | grep nisNetgroupTriple | awk -F' ' '{print $2}' > /tmp/netgrp-hosts.$$
lastentry=`tail -1 /tmp/netgrp-hosts.$$`
for j in `cat /tmp/netgrp-hosts.$$`
do
if [ $j == $lastentry ]
then echo -e "\t $j"
else
echo -e "\t $j \\"
fi
done
rm /tmp/netgrp-hosts.$$ /tmp/netgrp.$$
done

Wednesday, July 06, 2005

Configuring NeoPath for multi-tiered storage

The NeoPath Solution

We have embarked on a path whereby data is no longer incrementally backed up to tape, but rather backed up to a second tier of disk space. The reasoning is that tape technology is no longer able to keep up with disk technology and pricing. As our primary storage grows year over year, a long term strategy for backing it up is required. Tape is still used, but it is now regulated to archival purposes only, and at long intervals at that.

There are various 1st tier NAS solutions we are using, from NetApp to commodity storage with and without their own snapshot technologies. However, in most cases, snapshots are available for fine-grained file level backups that cover hourly changes or at least multiple images of the filesystem per day. However, each solution tends to have its own snapshot directories, and not all are relative to each directory in a volume. In some cases, snapshots are only available at the root of a volume.

Our 2nd tier solutions are even more commodity, consisting of standard journaling filesystems on large volumes, representing at least double the capacity of its matching first tier. We provide backups using hard-link style snapshots using rsnapshot. These pull an initial copy from the first tier either with rsync or via an NFS mount and periodically (daily) copy over deltas into new dated directories, preserving untouched files with hard links. 2nd tier storage is coarser, representing daily snapshots over multiple months of incremental diffs. At regular intervals (6 months on average), a 2nd tier snapshot is used as the source of a tape archival. Again, these snapshots tend to differ from the first tier in its layout and directory structure. More importantly, the 2nd tier systems are not directly accessible by the end users.

Finally, the multi-tier solution breaks out storage onto multiple distinct servers. How do end users know where to get their data, and how can they acquire self-serve restorations? The solution we have found is NeoPath. This product acts as an NFS or CIFS aggregator, allowing new logical paths to be made to consolidate storage into a single logical tree if necessary. It also provides for live data migration between servers, so it protects ones continuing investment of 1st and 2nd tier storage solutions, allowing for the acquisition of new 1st tier storage and migration of older storage to the backend or out of service entirely. The migration concept can be a critical feature when failing hardware needs to be replaced and its impossible to disentangle a system from centralized storage services. Other features include defining virtual servers, synthetic directory trees, and synthetic links and unions formed from back end mounted file servers.

Minor Quibbles

For all of its advantages, the NeoPath product still has its faults. Primarily, design decisions made to do the right thing can at first get in the way of its basic implementation. The first problem we noticed is that NeoPath on its face requires each backend share to be read-write since it needs to store meta data about that share on each back end file server (within a hidden directory). In the case of snapshot file systems directly exported, you are only given a read-only system by design of the NAS. Second, even in the case of our 2nd tier hard-link level file snapshots, we do not wish to expose that filesystem to the outside as read-write. When you re-export multiple tiers into a single tree, clients are generally given read-write permissions to their primary directories, and its only possible to enforce read-only permissions on parts of that tree if the NeoPath itself is reading it read-only.

The other problem is that although clients can mount at any permissible point in a share, the NeoPath product itself only permits inclusion of back end directories into its virtual trees via the explicitly defined exports of the back end file servers. You can not directly reference a deeper directory in the formation of unions or synthetic links. It also will only honor the first mount point in encounters in traversing a back end file server. Thus, you can not get around the issue by defining multiple levels of exported points to mount from.

All Problems Have Solutions

We have successfully resolved these issues with a little guidance from NeoPath. Overtime, I hope to refine and further explain these solutions.

First, whereas the web-based GUI does not let you explicitly define certain options on accessing back end servers, the command line environment does. You need to consult the command line reference guide, but the gist is that to allow for read-only volumes, one can define alternative dstorage locations (NeoPath speak). What we did was define a small read-write share from a NAS that was 64MB in size. That was the smallest available volume offered by our NAS, so you can likely go smaller. This volume will serve as metadata storage for all file servers if you instruct the NeoPath product to do so. Now, we can pull in any type of share regardless of write permissions.

The second problem has a similar but more involved solution. One can get past the issue of deep linking on back end servers by creating another minimal volume where symbolic links will be created. The idea is that when creating a synthetic directory, one should mount all back end file servers in a uniform path. The primary paths in the synthetic directory that users will see should be created in the small volume, utilizing relative symbolic link references to the uniform paths including any necessary deep references. An example would be useful here. Take this directory structure:

/myorg/users : (/myorg/users/jlittle -> /myorg/tier/1/vol7/users/jlittle)
/myorg/backup/tier1 : (/myorg/backup/tier1/users/jlittle -> /myorg/tier/1/vol7/.snap)
/myorg/backup/tier2 : (/myorg/backup/tier2/users/jlittle -> /myorg/tier/2/vol7)
/myorg/tier/1 : (contains vol1 through vol7 mounts of tier 1 system)
/myorg/tier/2 : (contains vol1 through vol7 mounts via a union of tier 2 systems)

In the above example, the users and backup trees are synthetic links to two small volumes defined on a NAS for the purpose of generating symbolic link trees. The last two lines are direct synthetic links to back end 1st and 2nd tier storage. The various backup points are actually links to the head of each snapshot volume, as the user first needs to traverse into a date-labeled directory before proceeding in /users/username or the equivalent. To generate the symbolic link tree, I took output from file listings that show actual relative paths per user (eg: ../vol7/users/jlittle) and built a little script to be run from a system mounting the base tree of /myorg.

#!/bin/bash

MNTDIR=/mnt
SRCFILE=/root/users-lists

for LINE in `cat $SRCFILE`
do
VOL=`echo $LINE | awk -F'/' '{ print $2 }'`
USER=`echo $LINE | awk -F'/' '{ print $4 }'`
echo $VOL $USER
cd $MNTDIR/users
ln -sf ../tier/1/$VOL/users/$USER $USER
cd ../backup/tier1/users/
ln -sf ../../../tier/1/$VOL/.snap $USER
cd ../../tier2/users/
ln -sf ../../../tier/2/$VOL $USER
done

The only issue left is to provide easy maintenance of this link list as users move around. Its an exercise left to administrators to tie this into their account creation and migration scripts/processes.

Thursday, June 30, 2005

What I may say...

This is a first blog entry, and as such, it will serve as an introduction to what may show up here. I've played with various blog clients (iBlog is great!), wiki's, CMS, etc. In the end, I want consistency in what I use, and the use is pretty erratic. I mostly need a place to build documentation for the various complete, semi-complete, or planning stage projects that I'm always doing in parallel. I find my attempts at documentation wanting. Therefore, any consistent blog approach may in fact aid in these efforts.