<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:gd='http://schemas.google.com/g/2005' xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-14093712</id><updated>2012-01-04T20:51:36.812-08:00</updated><category term='nexenta shell developer root'/><category term='active directory'/><category term='domains'/><category term='zfs'/><category term='multi-tier storage'/><category term='SmartOS'/><category term='cache'/><category term='tiering'/><category term='opendirectory'/><category term='Beckwith'/><category term='gorm'/><category term='upgrade'/><category term='zfs manageability cdp pNFS lustre'/><category term='macosx server'/><category term='KVM'/><category term='grails'/><category term='lustre'/><category term='free nas'/><category term='nexentastor'/><category term='ldap'/><category term='disk controller'/><category term='services'/><category term='developer'/><category term='nfs'/><category term='disaster recovery'/><category term='Western Digital'/><category term='ddrdrive'/><category term='amanda'/><category term='nexenta'/><category term='double time'/><category term='java'/><category term='webobjects'/><category term='cifs'/><category term='adaptec'/><category term='10.5'/><category term='models'/><category term='migration'/><category term='zil_disable'/><category term='nas'/><category term='x1'/><category term='SATA'/><category term='SDC'/><category term='tape'/><category term='sid'/><category term='block device'/><category term='iram'/><category term='iscsi'/><category term='log'/><category term='samba'/><category term='slog'/><category term='Joyent'/><category term='opensolaris'/><category term='resilver'/><category term='WD'/><category term='sas'/><category term='10.4'/><title type='text'>Little Notes</title><subtitle type='html'></subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://jmlittle.blogspot.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/14093712/posts/default?max-results=100'/><link rel='alternate' type='text/html' href='http://jmlittle.blogspot.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><author><name>jmlittle@gmail.com (Joe Little)</name><uri>http://www.blogger.com/profile/09731419203596760536</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>31</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>100</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-14093712.post-7522800156958915815</id><published>2012-01-04T20:42:00.000-08:00</published><updated>2012-01-04T20:51:36.837-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Joyent'/><category scheme='http://www.blogger.com/atom/ns#' term='KVM'/><category scheme='http://www.blogger.com/atom/ns#' term='SmartOS'/><category scheme='http://www.blogger.com/atom/ns#' term='SDC'/><title type='text'>Getting hands dirty with Joyent SDC: first lesson learned</title><content type='html'>Finally getting into Joyent's private cloud technology. I'll talk more about what all of this is useful for some other time, but this post is more of a note to self / note of warning. I repurposed some beefy ESX nodes for testing out Smart Data Center. But, those didn't have disks worth anything. Instead, I took some disks that were evacuated out of ZFS pools for larger drives. They would still be fine here...&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The problem arises in setting up compute nodes, and later in any re-installing if necessary of the headnode. Things would quietly fail without any errors on compute node configuration, and re-installs of the head node dug a deeper hole. Turns out that Joyent is being ever too cautious in creating data pools for the head and compute nodes, and won't attempt to create the necessary local disk pools if the disks were previously associated with active ZFS pools. Silent errors are never good.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The work around is to bring up the head node in their recovery mode, which is noted as not importing any pools. Next, associate the drives, import the pools (if fully there) or create a new pool for each individual disk, and then "zpool destroy" them. Rinse, repeat. I finally got my head node installed in a sane way, and now on to some remaining problems with compute nodes and testing out KVM and vcpu support. More on that later.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/14093712-7522800156958915815?l=jmlittle.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://jmlittle.blogspot.com/feeds/7522800156958915815/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=14093712&amp;postID=7522800156958915815' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/14093712/posts/default/7522800156958915815'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/14093712/posts/default/7522800156958915815'/><link rel='alternate' type='text/html' href='http://jmlittle.blogspot.com/2012/01/getting-hands-dirty-with-joyent-sdc.html' title='Getting hands dirty with Joyent SDC: first lesson learned'/><author><name>jmlittle@gmail.com (Joe Little)</name><uri>http://www.blogger.com/profile/09731419203596760536</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-14093712.post-6689549959878570218</id><published>2011-05-30T08:13:00.000-07:00</published><updated>2011-05-30T10:42:51.147-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='webobjects'/><category scheme='http://www.blogger.com/atom/ns#' term='services'/><category scheme='http://www.blogger.com/atom/ns#' term='gorm'/><category scheme='http://www.blogger.com/atom/ns#' term='domains'/><category scheme='http://www.blogger.com/atom/ns#' term='models'/><category scheme='http://www.blogger.com/atom/ns#' term='grails'/><category scheme='http://www.blogger.com/atom/ns#' term='Beckwith'/><title type='text'>Web Frameworks / Models just aren't the same</title><content type='html'>Most of my days are spent hacking on web applications, with a strong requirement for databased-backed solutions. I've been drinking the WebObjects cool aid for quite some time, as there hasn't been a robust ORM (Object-Relational Mapping) solution that matches the maturity and the it-just-works of WebObjects' EOF layer. &lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;However, the proverbial writing has been on the wall when it comes to Apple's continuing care taking of the public version of this technology. A lot of technologies have arisen to make the continuing effort one needs to take using WebObjects questionable, and my mind simply can't quite get around the rule engine solution of Modern Direct2Web that helps modernize WebObjects to match. Its always a question of finding the best tool for the job, and part of the toolset is one self.  Am I sharp or honed enough to meet the new challenges I face? I've been both re-investing myself in WebObjects daily, but also checking out other frameworks. In almost all cases, I find again and again that they still don't match the now antiquated WO in getting things done right.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Then there is Grails. First, its not Rails, which leaves a sour taste in my mouth. But it seems to take enough from both the Java/WO and Rails worlds, some of the best and some of the worst (Servlets, bleh!). I'm also stuck dealing with both hibernate's deficiencies and Grails band aids above that. Burt Beckwith has provided multiple articles on the brain-dead dealings with collections and especially many-to-many relationships, requiring fetching of all entities to guarantee uniqueness in add and delete operations (its more an issue with BelongsTo and hasMany, &lt;a href="http://burtbeckwith.com/blog/?p=191"&gt;original example&lt;/a&gt; and here's &lt;a href="https://mrpaulwoods.wordpress.com/2011/02/07/implementing-burt-beckwiths-gorm-performance-no-collections/"&gt;indirect implementation details&lt;/a&gt;). Obviously, the object graph shows some immaturity. Grails 1.4 and the underlying updates though finally get me past my fears and concerns, and so a few projects are now being built on Grails since I just can't get the quick build out of applications above the model layer I need in Modern D2W, and I require the dynamism of groovy for certain specific requirements. Again, its more finding the tool that suits me best and not the limitations of the tools.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;This brings us to the meat of my posting today. EOF and Wonder's templates have spoiled me in what model code (including the generation gap pattern) is provided for me and what I expect at the model layer. I'm be trying to come to terms with both the features and lack thereof of model classes in Grails apps. Rereading the great book &lt;a href="http://www.manning.com/gsmith/"&gt;Grails In Action&lt;/a&gt;, I came to an important realization on what is missing here. Section 5.2 gets into the best practice of using Grails Services to encapsulate business logic per se and follow DRY principles. But, if one considers at least the MVC frameworks and where model logic goes, there seems to be a lot of the multi-domain logic (relationships) which never end up in Grails domains and which one needs to best handle in Services. In the end, I've come to believe that a direct mapping of WebObjects EOF models is not to Grails domains, but to Grails services instead. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;With all the time in the world, I think I'd want to spend time on a plugin or template enhancements to auto-generate more complete service definitions from "grails create-service", one which takes a domain and extends it for basic operations, but builds out basic relationship management methods in the service. This would also be an ideal place to be collections aware and turn into best practice some of Beckwith's ideas. If collections were always handled in the same manner in code, it would make the complicated implementation of the correct, performant way much more trivial. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Furthermore, akin to the generation gap pattern, domains would be less tinkered with other than defining what can and should go into the database directly. This is important for managing database migrations. Instead, any and all custom logic should persist in the service. Perhaps one day Domains will get all the correct relationship handling logic that EOF superclasses generally get, and the Service is then more akin to the custom-logic-only aspect that I've come to expect of EOF subclasses for my model objects. However, I feel my mind can work with this construct to productive quickly in Grails instead of fighting against the grain or dirtying my controllers with model specific mess. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;For now though, I will endeavor to always use Services extensively, and make sure any generated scaffolding takes them into account more than Domains. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/14093712-6689549959878570218?l=jmlittle.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://jmlittle.blogspot.com/feeds/6689549959878570218/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=14093712&amp;postID=6689549959878570218' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/14093712/posts/default/6689549959878570218'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/14093712/posts/default/6689549959878570218'/><link rel='alternate' type='text/html' href='http://jmlittle.blogspot.com/2011/05/web-frameworks-models-just-arent-same.html' title='Web Frameworks / Models just aren&apos;t the same'/><author><name>jmlittle@gmail.com (Joe Little)</name><uri>http://www.blogger.com/profile/09731419203596760536</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-14093712.post-2197530077960075489</id><published>2010-05-06T08:55:00.000-07:00</published><updated>2010-05-06T09:02:52.758-07:00</updated><title type='text'>NexentaStor issues?</title><content type='html'>Someone pointed out &lt;a href="http://www.how-to-hide-a-corpse-on-federal-land.com/words/index.php?title=nexentastor_is_a_no_go&amp;amp;more=1&amp;amp;c=1&amp;amp;tb=1&amp;amp;pb=1"&gt;this "review"&lt;/a&gt; to me and asked if it was true. I ran into a similar issue. The user in that article was using the free 12TB edition without support, so perhaps that was why they didn't ask around per se or file a bug. &lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;So, why is copying from a ZFS volume to another over rsync seemingly going on forever? I can't be sure this is the issue, but I had the same result, but this time it was going from a NetApp to a ZFS data store using NexentaStor 3.0. The problem was that the source .snapshot tree was exposed, and likely in the case of the above reviewer, their .zfs tree was exposed. I've already mentioned to the Nexenta people that its safer to have as a default exclude the terms ".snapshot | .zfs" for rsync service definitions, and let the end user override it. I too first thought it was the dedup going awry, but what I found out the problem to be on experimentation was rsync discovering those hidden paths and syncing each one. Dedup will only find duplicate blocks that line up, but the overall exposure to all those snapshots will come at some price.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;If you are pulling data from one snapshot-based file system to another, it is always best to do so relative to the most recent snapshot, as you are insured data isn't changing during the synchronization, and you'll avoid falling down the snapshot well.&lt;/div&gt;&lt;div&gt; &lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/14093712-2197530077960075489?l=jmlittle.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://jmlittle.blogspot.com/feeds/2197530077960075489/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=14093712&amp;postID=2197530077960075489' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/14093712/posts/default/2197530077960075489'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/14093712/posts/default/2197530077960075489'/><link rel='alternate' type='text/html' href='http://jmlittle.blogspot.com/2010/05/nexentastor-issues.html' title='NexentaStor issues?'/><author><name>jmlittle@gmail.com (Joe Little)</name><uri>http://www.blogger.com/profile/09731419203596760536</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-14093712.post-2553158442093953538</id><published>2010-03-17T10:50:00.000-07:00</published><updated>2010-03-19T15:12:54.707-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='zfs'/><category scheme='http://www.blogger.com/atom/ns#' term='WD'/><category scheme='http://www.blogger.com/atom/ns#' term='sas'/><category scheme='http://www.blogger.com/atom/ns#' term='SATA'/><category scheme='http://www.blogger.com/atom/ns#' term='Western Digital'/><title type='text'>WD Caviar Green drives and ZFS (UPDATED)</title><content type='html'>We are in the process of outfitting a new primary storage system, and I was of the mind to buy more WD Caviar Green drives, specially more of the 1.5TB WDEADS drives, as we had 4 new ones already that were tested behind a slower RAID card. Before buying more, I searched the usual suspects for pricing, and found the 1TB to 2TB versions of this drive are all priced very well, even for 5400RPM drives, but they now note on different sites and/or comments that they should not be used in RAID configurations. Hmm.&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I did a little more research and saw &lt;a href="http://breden.org.uk/2009/05/01/home-fileserver-a-year-in-zfs/"&gt;this blog post&lt;/a&gt; depicting how one should avoid directly integrating these drives with ZFS. I got a couple, so I decided to put them in my server with an LSI-3442E SAS backplane and tested them. First, I tested my 500GB drives in a mirror set, and doing a "ptime dd if=/dev/zero of=test1G bs=4k count=250000" on the ZFS volume made up of those drives, I transferred 1GB in 3.63 seconds, or 282MB/sec. I then immediately tried the same on my mirror set of the WD drives, benefitting from caching of the first write. After 50+ minutes of waiting, I killed the write and saw that I transferred only 426MB, at a rate of 136KB/sec. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Yes, I can confirm that these drives are less than useless in a ZFS system (&lt;b&gt;see update below&lt;/b&gt;), even as a simple two disk mirror set. Some basic iostat showed way too much "asvc_t" service time on the disks, running from 3.5 secs to 10 secs per write, where as the service times for the working 500GB drives were 0.7msec or the like. I had various errors mpt_handle_event_sync errors in my kernel logs, so perhaps there is some specific pathology between the SAS HBA, the SAS/SATA backplane, and these disks. However, we've proven this box works well with various drives. I'm going to try yet another 1.5TB drive, likely the previously maligned Seagate drives, since I've yet to have trouble with the latest firmware on those. My 4 WD drives will be placed in enclosures for external Time Machine backups in the near future. WD Caviar Green != Enterprise RAID drives. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;UPDATE&lt;/b&gt;:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I'm leaving the above as is, but I think I have discovered perhaps a bad drive in the set, as when I employee 4 drives of this type I saw odd I/O patterns but ok performance in a straight RAID 0. However, I regularly have at least one drive with higher average service times, and trailing I/O writes as it catches up to the other drives. If I have these 4 drives in a pool (RAID 0), I got 193MB/sec writes, and 242MB/sec reads. Sticking them into a RAID10 (2 data, 2 mirror), I got a mirror 78MB/sec writes and 278MB/sec reads. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Splitting them off into two separate RAID1 data pools, I ran my tests and still saw high service times on the drives (only 65 or so, much better than the above, but still slow). Per mirror set performance was dismal, as I regularly got the 150MB/sec+ from a mirror of Caviar Black, but these drives got me just  hit 31-34MB/sec (ie, half of the above RAID10). I guess with enough drives I'll get to better numbers in RAID10. In a RAIDZ1 (RAID5) grouping, it was 60MB/sec on the writes, and 172MB/sec on the reads. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;So what accounts for the dismal performance I originally saw? I think it has to do with when multiple pools are active, and they are not all of this drive type. My original test had a Hitachi drive set as well as a WD Caviar Green drive set. Although my tests ran one at a time, I'm guessing there was some bad timing/driver issues and/or hardware issues  when dealing with the mixed HD media. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;A second, update conclusion is that you can use these drives, if only these drive types, in an array. RAID10 will get you sufficient performance, but otherwise you'll want to leave this to secondary storage. Future drive replacement scenarios are a real cause for concern.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/14093712-2553158442093953538?l=jmlittle.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://jmlittle.blogspot.com/feeds/2553158442093953538/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=14093712&amp;postID=2553158442093953538' title='9 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/14093712/posts/default/2553158442093953538'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/14093712/posts/default/2553158442093953538'/><link rel='alternate' type='text/html' href='http://jmlittle.blogspot.com/2010/03/wd-caviar-green-drives-and-zfs.html' title='WD Caviar Green drives and ZFS (UPDATED)'/><author><name>jmlittle@gmail.com (Joe Little)</name><uri>http://www.blogger.com/profile/09731419203596760536</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>9</thr:total></entry><entry><id>tag:blogger.com,1999:blog-14093712.post-7928911368378668371</id><published>2010-03-02T10:54:00.000-08:00</published><updated>2010-03-03T07:29:43.863-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='nexenta'/><category scheme='http://www.blogger.com/atom/ns#' term='slog'/><category scheme='http://www.blogger.com/atom/ns#' term='zil_disable'/><category scheme='http://www.blogger.com/atom/ns#' term='log'/><category scheme='http://www.blogger.com/atom/ns#' term='ddrdrive'/><category scheme='http://www.blogger.com/atom/ns#' term='zfs'/><category scheme='http://www.blogger.com/atom/ns#' term='nfs'/><category scheme='http://www.blogger.com/atom/ns#' term='x1'/><category scheme='http://www.blogger.com/atom/ns#' term='nexentastor'/><title type='text'>ZFS Log Devices: A Review of the DDRdrive X1</title><content type='html'>My previous notes here have covered the trends to commodity storage, my happiness with most things ZFS and Nexenta, and how someday this will all make for a great primary storage story. At Stanford, we have a lot of disk-to-disk backup storage based on &lt;a href="http://www.nexenta.com/"&gt;Nexenta&lt;/a&gt; solutions, using iSCSI or direct attached storage. We have also had some primary tier uses, but have had to play fast and loose with ZFS to get comparable performance. In essence, we sacrificed some of the ensured data integrity of ZFS to meet end users expectations of what file servers provide. &lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;A typical thing that was done was to set these values:&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;set zfs:zil_disable = 1&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;set zfs:zfs_nocacheflush = 1&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:georgia;"&gt;These flags allowed a ZFS appliance to perform similarly to Linux or other systems when it came to NFS server performance. When you are writing a lot of large files, the ZFS Intent Log's additional latency doesn't affect NFS client performance. However, when these same clients expect their fsyncs to be honored on the back end with mixed file sizes that trend to a large volume of small writes, we start to see pathologically &lt;a href="http://weblog.etherized.com/posts/130"&gt;poor performance&lt;/a&gt; with the ZIL enabled. We can measure the performance at 400KB/sec in some of my basic synthetic tests. With the ZIL disabled, I generally got 3-5MB/sec or so, or 10x the performance. That's &lt;a href="http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide"&gt;cheating&lt;/a&gt; and &lt;a href="http://blogs.sun.com/erickustarz/entry/zil_disable"&gt;not so safe&lt;/a&gt; if the client thinks a write is complete but the backend server doesn't commit it before power loss or crash.&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:georgia;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:georgia;"&gt;One ray of hope previously mentioned &lt;a href="http://jmlittle.blogspot.com/2008/01/using-iram-improving-zfs-perceived.html"&gt;on this site&lt;/a&gt; was the &lt;a href="http://www.gigabyte.com.tw/Products/Storage/Products_Overview.aspx?ProductID=2180"&gt;Gigabyte i-RAM&lt;/a&gt;. This battery backed SATA-I solution held some promise,  but at the time I used it I found a few difficulties. First, the state of the art at that time did not allow removal of log (ZIL-dedicated) devices from pools. One had to recreate a pool if the log device failed. That raised some problems with the i-RAM. First, I had it go offline twice requiring resetting the device, essentially blanking it out and requiring re-initializing it as a drive with ZFS. Second, the connection was SATA-I only, with it not playing well with certain SATA-II chipsets or &lt;a href="http://jmlittle.blogspot.com/2008/05/mixing-sata-dos-and-donts.html"&gt;mixed with SATA-II devices&lt;/a&gt;. Many users had to enable it in IDE mode versus the preferred AHCI mode. &lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:georgia;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:georgia;"&gt;Time has passed, and new solutions present themselves. First, log devices can be added or removed from a pool at any time, on the fly. Also new to the discussion is the &lt;a href="http://www.ddrdrive.com/"&gt;DDRdrive X1&lt;/a&gt; product. This mixed RAM and NAND device provides for a 4G drive image with extremely high IOPS and a solution to save to stable store (NAND SLC flash) if power is lost on the PCI bus. The device itself is connected to a PCI-Express bus, with drivers for OpenSolaris/Nexenta (among others) that make it visible as a SCSI device. &lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:georgia;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:georgia;"&gt;I tried different scenarios with this ZIL device, and all of them make it a sweet little device. I had mixed files that I pushed onto the appliance via NFS (linux client) and found that I could multiply the number of clients and linearly increase performance. Where I would hit 450KB/sec without the ZIL device but not improve that rate by much with additional writers of data, using the ZIL log device immediately resulted in a good 7MB/sec of performance, with 4 concurrent write jobs yielding 27MB/sec. During this test, my X1 showed only a 20% busy rate using iostat. It would appear that I should get up to 135MB/sec at this rate  (5x the concurrent writers), but my network connection was just gig-e, so getting anywhere near 120+MB/sec would be phenomenal. Another sample of mixed files with 5 concurrent writers pushed the non-X1 config to 1.5MB/sec, but in this case, the X1 took my performance numbers to 45-50MB/sec.&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:georgia;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:georgia;"&gt;So what is providing all this performance? As I mentioned above, the fsyncs on writes from the NFS client enforce synchronous transactions in ZFS when the ZIL is not disabled. My IOPS (I/O Operations per second) without a X1 log device were measured around 120 IOPS. With the dedicated RAM/NAND DDRdrive X1 solution, I easily approach 5000 IOPS. Those commits happen quickly, with the final stable store to your disk array laid out in your more typical 128K blocks per IOP. This dedicated ZIL device has been shown to do up to 200000 IOPS in synthetic benchmarks. Lets try the NFS case one more time, in a somewhat more practical test.&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:georgia;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:georgia;"&gt;Commonly, in simulation, CAD applications, software development, or the like you will be conversing with the file server committing hundreds to thousands of small file writes. To test this out and make it the worse case scenario of disk block-sized files, I created a directory of 1000 512 byte files on the clients local disk. I did multiple runs to make sure this fit in memory so that we were measuring file server write performance. I then ran 400 concurrent jobs writing this to the file server into separate target directories. First, with the dedicated ZIL device enabled, I got 24MB/sec write rates averaging 6000 IOPS. I did spike up to 43K IOPS and 35MB/sec, likely when committing some of the metadata associated with all these files and directories. Still, the X1 was only averaging 20% busy during this test.&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:georgia;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:georgia;"&gt;Next, I disabled the DDRdrive X1 and tried again, hitting the same old wall. This was the pathological case. With 400 concurrent writes I still just got 120 IOPS and 450KB/sec. My only thought at the time was "sad, very sad". &lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:georgia;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:georgia;"&gt;You can draw your own conclusions from this mostly not-too-scientific test. For me, I now know of an affordable device that has none of the drawbacks (4K block size, wear leveling) of SSD drives for use as a ZIL device. One can now put together a commodity storage solution with this and Nexenta, and have the same expected performance without compromise as one would expect from any first tier storage platform. &lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:georgia;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:georgia;"&gt;That leads me to the "one more thing" category. I decided to place some ESX NFS storage-pooled volumes on this box, and compare it to the performance of the NetApps we use to manage our ESX VMs (NFS). The file access modes of the VMs tend to be similar to mixed size file operations, but they do tend to be larger writes so the ZIL may not have as drastic of an effect. Anyway, I tried it without the X1 and I got 30-40MB/sec measured disk performance from operations within the VM (random tests, dd, etc). Enabling the ZIL device, I got 90-120MB/sec rates, so we still got a 3x improvement. I couldn't easily isolate all traffic away from my NetApps, but I averaged 65MB/sec on those tests.&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:georgia;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:georgia;"&gt;Here, I think the conclusion I can draw is this: The dedicated ZIL device again improved performance up to matching what I theoretically can get from my network path. The comparison one can safely make with a NetApp is not that its faster, as my test ran under different loads, but that it likely can match the line rates of your hardware and remove from the equation any concern for filesystem and disk array performance. Perhaps in a 10G network environment or with some link aggregation we can start to stress the &lt;a href="http://www.ddrdrive.com"&gt;DDRdrive X1&lt;/a&gt;, but for now its obvious that it enables commodity storage solutions to meet typical NAS performance expectations. &lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:georgia;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:georgia;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/14093712-7928911368378668371?l=jmlittle.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://jmlittle.blogspot.com/feeds/7928911368378668371/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=14093712&amp;postID=7928911368378668371' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/14093712/posts/default/7928911368378668371'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/14093712/posts/default/7928911368378668371'/><link rel='alternate' type='text/html' href='http://jmlittle.blogspot.com/2010/03/zfs-log-devices-review-of-ddrdrive-x1.html' title='ZFS Log Devices: A Review of the DDRdrive X1'/><author><name>jmlittle@gmail.com (Joe Little)</name><uri>http://www.blogger.com/profile/09731419203596760536</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-14093712.post-5657567966637487767</id><published>2009-11-27T20:54:00.000-08:00</published><updated>2009-11-27T21:09:50.580-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='zfs'/><category scheme='http://www.blogger.com/atom/ns#' term='resilver'/><category scheme='http://www.blogger.com/atom/ns#' term='nexentastor'/><category scheme='http://www.blogger.com/atom/ns#' term='double time'/><title type='text'>ZFS Resilver quirks and how it lies</title><content type='html'>One of my ZFS-based storage appliance was running low on disk space, and since I made it a three way stripe of mirrored disks, I could take the 6 500GB drives and replace them with 1.5TB drives each in place, with the result a major increase in capacity. Nifty ZFS software RAID feature versus typical hardware RAID setups. Its all good in theory, but resilvering (rebuilding an array pair) after replacing a drive takes quite some time. Even with only about 400GB to rebuild per drive, one sees the resilvering process cover 90% of the rebuild in 12 hours or so, but that last 10% takes another 10-12 hours. I think this has a lot to do with how snapshots or small files hurt ZFS performance, especially when you are close to a full disk. But its all just as guest as to why its slow on the tail end.&lt;br /&gt;&lt;br /&gt;The resilver went as planned, replacing one drive after another serially, but taking care to only do one drive of a pair at a time. Near the end, I started to get greedy. With 98% done on one resilver, I detached a drive in another mirrored pair on the same volume, planning on at least placing the new drive into the chassis so I could start the final drive resilver remotely. To my surprise, the resilver restarted from scratch, so I had another 24 hours of delay to go. So, any ZFS drive removals will reset in progress scrubs/resilvers!&lt;br /&gt;&lt;br /&gt;I then decided just to go ahead with the second resilver. This is where it got really strange. The two mirrored pairs started to resilver, and the speed was seemingly faster. After 12 hours, both pairs had about 400GB resilvered and the status of the volume indicated it was 100% done and was finishing. Hours later, it was still at 100%, but the resilver counter per drive kept climbing. Finally, after the more typical 24 hours or so, it noted it was completed.&lt;br /&gt;&lt;span class="Apple-style-span"   style="font-family:monospace, serif;font-size:100%;"&gt;&lt;span class="Apple-style-span" style="font-size: 13px; white-space: pre;"&gt;&lt;span class="Apple-style-span"   style="font-family:Georgia, serif;font-size:130%;"&gt;&lt;span class="Apple-style-span" style="font-size: 16px; white-space: normal;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;pre&gt;  pool: data&lt;br /&gt;state: ONLINE&lt;br /&gt;scrub: resilver completed after 26h39m with 0 errors on Tue Nov 24 22:33:46 2009&lt;br /&gt;config:&lt;br /&gt;&lt;br /&gt; NAME        STATE     READ WRITE CKSUM&lt;br /&gt; data        ONLINE       0     0     0&lt;br /&gt;   mirror    ONLINE       0     0     0&lt;br /&gt;     c2t1d0  ONLINE       0     0     0&lt;br /&gt;     c2t0d0  ONLINE       0     0     0&lt;br /&gt;   mirror    ONLINE       0     0     0&lt;br /&gt;     c2t3d0  ONLINE       0     0     0&lt;br /&gt;     c2t2d0  ONLINE       0     0     0  783G resilvered&lt;br /&gt;   mirror    ONLINE       0     0     0&lt;br /&gt;     c2t5d0  ONLINE       0     0     0&lt;br /&gt;     c2t4d0  ONLINE       0     0     0  781G resilvered&lt;/pre&gt;&lt;br /&gt;Yes, it looks like at least with this B104+ kernel in NexentaStor, the resilver counters lie. When you have two ongoing resilvers, each counter is nominally the total data resilvered across the whole pool. You'll thus need to wait for double the expected data amount before it completes. Thus, its very important to not reset the system until 100% turns into a "resilver completed..." statement in the status report.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/14093712-5657567966637487767?l=jmlittle.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://jmlittle.blogspot.com/feeds/5657567966637487767/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=14093712&amp;postID=5657567966637487767' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/14093712/posts/default/5657567966637487767'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/14093712/posts/default/5657567966637487767'/><link rel='alternate' type='text/html' href='http://jmlittle.blogspot.com/2009/11/zfs-resilver-quirks-and-how-it-lies.html' title='ZFS Resilver quirks and how it lies'/><author><name>jmlittle@gmail.com (Joe Little)</name><uri>http://www.blogger.com/profile/09731419203596760536</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-14093712.post-8391417179249947378</id><published>2009-08-18T10:47:00.001-07:00</published><updated>2009-08-18T11:00:40.588-07:00</updated><title type='text'>Prepping for Snow Leopard Server and a lesson on backups</title><content type='html'>We all know that MacOSX 10.6 Server is coming out RSN. All of us who use OpenDirectory are starting to wonder about the pain that will soon endure when upgrading. Here's a few hints to keep in mind.&lt;br /&gt;&lt;br /&gt;- Time Machine Backups do not by default restore a good MacOSX Server image. Read all about it &lt;a href="http://www.bill.eccles.net/bills_words/2008/08/designed-to-fail-apple-time-ma.html"&gt;here&lt;/a&gt; and learn now what will go wrong. Namely, edit the mentioned StdExclusions.plist file to remove /var/log and /var/spool from the exclusion list, and consider recreating your backups from scratch&lt;br /&gt;&lt;br /&gt;- If you have ADC membership or otherwise can purchase WWDC 09 videos, acquire Session 622, Moving to Snow Leopard Server. Lots of good stuff there, but I'll suggest a less than perfect but simpler upgrade path&lt;br /&gt;&lt;br /&gt;- To upgrade, use Carbon Copy Cloner or the like to make full bootable system copy on an external drive -- likely your time machine disk. At this point, you can also re-enable Time Machine to use the rest of the disk for backups using the corrected excludes list. Obviously, this disk should be far larger in size than what you have used on your OSX Server.&lt;br /&gt;&lt;br /&gt;- You might be upgrading to  a beefier 64-bit Intel configuration for your OpenDirectory master or just upgrading in place on the old hardware. I recommend using this on new hardware. Take that clone disk and boot off of it on the new box, and then clone yet again to the local disk or array. Now you can do an in place upgrade to 10.6 on non-production hardware, test, etc. Your previous master is now your first replica when you go production. If you upgrade in place, you should first test that the boot disk works as your primary first, but now you do have a full production-worthy backup disk.&lt;br /&gt;&lt;br /&gt;- Once you past a certain point in time, I'd remove the backupdbs on that external disk (don't erase it) and reuse it for Time Machine again. You now have a way to revert to 10.5 pre-upgrade or revert to any 10.6 point in time. You should check the exclusions file before commencing Time Machine backups to make sure you are getting the expected full server backup.&lt;br /&gt;&lt;br /&gt;- Profit&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/14093712-8391417179249947378?l=jmlittle.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://jmlittle.blogspot.com/feeds/8391417179249947378/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=14093712&amp;postID=8391417179249947378' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/14093712/posts/default/8391417179249947378'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/14093712/posts/default/8391417179249947378'/><link rel='alternate' type='text/html' href='http://jmlittle.blogspot.com/2009/08/prepping-for-snow-leopard-server-and.html' title='Prepping for Snow Leopard Server and a lesson on backups'/><author><name>jmlittle@gmail.com (Joe Little)</name><uri>http://www.blogger.com/profile/09731419203596760536</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-14093712.post-8240110669900409638</id><published>2008-08-02T22:40:00.001-07:00</published><updated>2008-08-02T23:10:16.653-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='amanda'/><category scheme='http://www.blogger.com/atom/ns#' term='zfs'/><category scheme='http://www.blogger.com/atom/ns#' term='nexentastor'/><title type='text'>Amanda: simple ZFS backup or S3</title><content type='html'>When I first started researching ZFS, I found it somewhat troubling that no native backup solution existed. Of course there was the ZFS send/recv commands, but those didn't necessarily work well with existing backup technologies. At the same time, the venerable open source backup solution, &lt;a href="http://www.amanda.org/"&gt;amanda&lt;/a&gt; had found a way to move beyond its limitation of maximum tape size restricting backup run size. Over time, we have found ways to marry these two solutions.&lt;br /&gt;&lt;br /&gt;In my multi-tier use of ZFS for backup, I always need an n-tier component that will allow for permanent archiving to tape every 6 months or year, as deemed fit for the data being backed up. These are full backups only, and due to the large amounts of data in the second tier pool, a backup to tape may span dozens of tapes and run multiple days. I found I had to tweak amanda's typical configuration to allow for very long estimate times, as the correct approach to backing up a ZFS filesystem today involves tar. Amanda's approach does a full tar estimate of a backup before a real backup is attempted. Otherwise, a sufficiently tape library is all you need and a working amanda client configuration on your ZFS-enabled system.&lt;br /&gt;&lt;br /&gt;For those following along, I'm an avid user of &lt;a href="http://www.nexenta.com/"&gt;NexentaStor&lt;/a&gt; for my second tier storage solution. Setup of an amanda client on that software appliance is actually quite easy.&lt;br /&gt;&lt;code&gt;&lt;br /&gt;setup network service amanda-client edit-settings&lt;br /&gt;setup network service amanda-client conf-check&lt;br /&gt;setup network service amanda-client enable&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;That's all that one needs to do. There is a sample line in the amanda configuration that you adjust in the first command above. The line I used is similar to this:&lt;br /&gt;&lt;code&gt;&lt;br /&gt;amandasrv.stanford.edu amanda amdump&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;You'll find that depending on your build of amanda server, that you'll either have the legacy user name of "amanda", the zmanda default of "amanda_backup", or the Redhat default of "backup" as the user things run as. I guess there had to be a user naming conflict at some point with "amanda".&lt;br /&gt;&lt;br /&gt;The hardest part of the configuration is finding where you have your long term snapshots. Since a backup run can take days to weeks, you'll likely wish to backup volumes relative to a monthly snapshot. In your amanda &lt;b&gt;/etc/amanda/CONFIDR/disklist&lt;/b&gt; configuration, a sample you may have for a ZFS-based client named &lt;b&gt;nexenta-nas&lt;/b&gt; with volume &lt;b&gt;tier2/dir*&lt;/b&gt; is:&lt;br /&gt;&lt;code&gt;&lt;br /&gt;nexenta-nas /volumes/tier2/dir1/.zfs/snapshot/snap-monthly-1-latest  user-tar-span&lt;br /&gt;nexenta-nas /volumes/tier2/dir2/.zfs/snapshot/snap-monthly-1-latest  user-tar-span&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;Note well the use of user-tar-span in the two lines above. This allows for the backing up large volumes over multiple tapes in amanda. That one limitation of tape spanning in amanda was solved in a novel way. They break up backup streams into "chunksizes" of a set size to allow for a write failure at the end of one tape to begin fresh again at the beginning of that chunk on the following tape. This feature allows amanda to also be used to backup to Amazon's S3 service. Yes, instead of going to tape, you can configure a tape server to write to an S3 service. S3 limits writes to a maximum of 2GB a file, and amanda's virtual tape solution combined with that chunk sizing of backups works wonderfully to mate ZFS-based storage solutions to S3 for an n-tier solution. Please consult &lt;a href="http://wiki.zmanda.com/index.php/How_To:Backup_to_Amazon_S3"&gt;Zmanda's howto&lt;/a&gt; for configuring your server correctly. There really is nothing left to configure  to get ZFS data to S3.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/14093712-8240110669900409638?l=jmlittle.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://jmlittle.blogspot.com/feeds/8240110669900409638/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=14093712&amp;postID=8240110669900409638' title='9 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/14093712/posts/default/8240110669900409638'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/14093712/posts/default/8240110669900409638'/><link rel='alternate' type='text/html' href='http://jmlittle.blogspot.com/2008/08/amanda-simple-zfs-backup-or-s3.html' title='Amanda: simple ZFS backup or S3'/><author><name>jmlittle@gmail.com (Joe Little)</name><uri>http://www.blogger.com/profile/09731419203596760536</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>9</thr:total></entry><entry><id>tag:blogger.com,1999:blog-14093712.post-404226705269614615</id><published>2008-07-27T10:06:00.001-07:00</published><updated>2008-07-27T10:06:43.391-07:00</updated><title type='text'>Pogo Linux, Nexenta announce StorageDirector Z-Series storage</title><content type='html'>Pogo Linux Inc., a Seattle-based storage server manufacturer, and Nexenta Systems Inc., developer of NexentaStor, an open storage solution based upon the revolutionary file system ZFS, announced Wednesday immediate availability of a new set of storage appliances featuring NexentaStor.Yeah.. that was the posted text above. What does it really mean? More kit choices to get a open storage NAS. Some nice configuration options when ordering, but I didn't see an easy was to request smaller system disks versus the rest of the data drives for any given Z series unit. Its a very good first step. If a Linux vendor adopts an appliance based on OpenSolaris (albeit a Debian/Ubuntu-lookalike), you know there is something cooking.&lt;br/&gt;&lt;br/&gt;&lt;a href='http://www.wwpi.com/index.php?option=com_content&amp;amp;task=view&amp;amp;id=4840&amp;amp;Itemid=128'&gt;read more&lt;/a&gt; | &lt;a href='http://digg.com/hardware/Pogo_Linux_Nexenta_announce_StorageDirector_Z_Series_storag'&gt;digg story&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/14093712-404226705269614615?l=jmlittle.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://jmlittle.blogspot.com/feeds/404226705269614615/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=14093712&amp;postID=404226705269614615' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/14093712/posts/default/404226705269614615'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/14093712/posts/default/404226705269614615'/><link rel='alternate' type='text/html' href='http://jmlittle.blogspot.com/2008/07/pogo-linux-nexenta-announce.html' title='Pogo Linux, Nexenta announce StorageDirector Z-Series storage'/><author><name>jmlittle@gmail.com (Joe Little)</name><uri>http://www.blogger.com/profile/09731419203596760536</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-14093712.post-8238936644951527604</id><published>2008-06-16T15:13:00.000-07:00</published><updated>2008-06-16T15:23:39.994-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='java'/><category scheme='http://www.blogger.com/atom/ns#' term='webobjects'/><title type='text'>Closet WO Developer</title><content type='html'>One of many hats I wear is that of a erstwhile java developer. Our internal apps have been heavily reliant on object relational mappers. I've dabbled into RoR, and even helped get a TurboGears project off the ground here that was fully open source. However, the primary solution we've used in production since 2002 has been the grand daddy of ORM solutions: WebObjects. &lt;br /&gt;&lt;br /&gt;This past week at Apple's WWDC was a great one for WebObjects. The usual NDA applies. However, prior to that the WebObjects community had their own two day in-depth conference in San Francisco. No NDA for that, and I can report that WO development is alive, well, and dare I say thriving? The news about &lt;a href="http://sproutcore.com"&gt;SproutCore&lt;/a&gt; has a second story, in that the backend of choice may be RoR, but the #1 apps will likely also be WO-based. Got an iPhone? Learn WO. As we opensource some of our projects here, I'll write a few more posts and speak on some more points, but with the latest release of WebObjects (5.4.x) the final deployment restrictions on the free WO frameworks were lifted. I expect some level of renewed interest.&lt;br /&gt;&lt;br /&gt;To find out more, check in with the &lt;a href="http://wocommunity.org"&gt;WOCommunity&lt;/a&gt;.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/14093712-8238936644951527604?l=jmlittle.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://jmlittle.blogspot.com/feeds/8238936644951527604/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=14093712&amp;postID=8238936644951527604' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/14093712/posts/default/8238936644951527604'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/14093712/posts/default/8238936644951527604'/><link rel='alternate' type='text/html' href='http://jmlittle.blogspot.com/2008/06/closet-wo-developer.html' title='Closet WO Developer'/><author><name>jmlittle@gmail.com (Joe Little)</name><uri>http://www.blogger.com/profile/09731419203596760536</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-14093712.post-551120290576504063</id><published>2008-06-04T09:44:00.000-07:00</published><updated>2008-06-04T10:18:02.619-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='nexenta'/><category scheme='http://www.blogger.com/atom/ns#' term='disk controller'/><category scheme='http://www.blogger.com/atom/ns#' term='zfs'/><category scheme='http://www.blogger.com/atom/ns#' term='sas'/><category scheme='http://www.blogger.com/atom/ns#' term='opensolaris'/><category scheme='http://www.blogger.com/atom/ns#' term='SATA'/><category scheme='http://www.blogger.com/atom/ns#' term='nexentastor'/><title type='text'>Recommended Disk Controllers for ZFS</title><content type='html'>Since I've been using OpenSolaris and ZFS (via NexentaStor, plug plug) extensively, I get a lot of emails asking about what hardware works best. There have been various postings on the opensolaris and zfs lists to the same effect. A lot of people reference the OpenSolaris HCL lists which leave the average user scratching their head with more questions than answers. More to the point, the HCL doesn't tend to answer the more direct question of what hardware should I get to build a ZFS box, NAS, etc. Its important to note that in the case of ZFS, all that extra checksum, fault management, and performance goodness can be negated by selecting a "supported" hardware RAID card. Worse yet, many RAID cards are not fully interchangeable on the spot.  What do you want for ZFS?&lt;br /&gt;&lt;br /&gt;First, pick any 64-bit dual core or better motherboard or processor. If you can get ICH6+, nvidia, or Si3124-based on board SATA, then you are in good shape for your basic ZFS box with on-board SATA for your system disks alone. System disk can tend to be low 5400RPM 2.5 inch SATA-I drives. Many people then desire some large memory, battery-backed RAID card, and my tests with the high end LSI SAS cards show that memory on the RAID card doesn't do you as much good as having a recipe of lots of system RAM, a sufficient number of cores, many disk drives for the spindles, and sufficient use of the PCIX/PCIe bus using JBOD only disk controllers. I'll cover the controllers next, but I'd recommend at this point 4GB of RAM minimum, dual core at greater than 2ghz, and for any good load, at least two PCI-X or multi-lane PCIe card.&lt;br /&gt;&lt;br /&gt;Disk controllers are where the real questions are asked. Over multiples iterations, heavy use, and some anecdotal evidence, we are down to some sweet spots. For PCI-X, there is one game in town, the Marvell-based AOC-SATA2-MV8, used in the X4500. At $100 for 8 JBOD SATA-II ports, it just works and is fault managed. Stick just SATA-II disks on these, and keep any SATA-I disks on the motherboard SATA ports for system disks. I'll add that various Si3124 based cards exist here, but not for sufficient port density.&lt;br /&gt;&lt;br /&gt;&lt;a href=http://www.newegg.com/Product/Product.aspx?Item=N82E16815121009&gt;SuperMicro AOC-SATA2-MV8 link&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;When it comes to PCIe, there isn't any good high port count options for SATA. If you need just 2 ports, or eSATA, there are various solutions based on the Si3124 chipset, and SIIG makes many of them for $50 each. However, in the PCIe world, the real answer is SAS HBAs that connect to internal or external mixed SAS/SATA disk chassis. Again, most SAS HBAs are either full fledged RAID without JBOD support, or simply don't work in the OpenSolaris ecosystem. 3ware is a lot cause here. The true winner for both cost and performance, while providing the JBOD you want, is the LSI SAS3442E-R.&lt;br /&gt;&lt;br /&gt;&lt;a href=http://www.cdw.com/shop/products/default.aspx?EDC=1385491&gt;CDW catalog link for LSI 3442ER&lt;/a&gt;&lt;br /&gt;&lt;a href=http://www.lsi.com/storage_home/products_home/host_bus_adapters/sas_hbas/lsisas3442er/index.html&gt;LSI 3442ER product page&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Its $250, but I've seen it as low as $130. 8 channels, with both 2 internal ports (generally 8 drives are connected to a single SAS port) as well as the external port. You can use this with an external SAS-backed array of SATA drives from Promise, for instance, to easily populate 16 or 32 drives internally, with an additional 48 drives externally, just from the one card. Would I suggest that many on that single card? No, but you can. Loading up your system with 2 or 4 of these cards, which are based on the LSI 1068 chipset that is well supported by Sun is the best way forward for scale out performance. I was given some numbers of 200MB/sec writes and 400MB/sec reads on an example 12-drive system using RAIDZ. Good numbers, as I got 600MB/sec reads on a 48-drive X4500 thumper.&lt;br /&gt;&lt;br /&gt;If you have PCI-X, go Marvell. PCIe? Go LSI, but stick to the JBOD-capable not-so-RAID HBAs. Don't just trust me, throw a $100 or two at these and try it yourself. You'll see a better investment than $800 at the larger RAID cards. I went the latter route and have paid dearly (Adaptec, LSI, you name it). What worked from the beginning and is working today are the Marvell cards here, and I've been playing with new systems that use the LSI 3442ER.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/14093712-551120290576504063?l=jmlittle.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://jmlittle.blogspot.com/feeds/551120290576504063/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=14093712&amp;postID=551120290576504063' title='15 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/14093712/posts/default/551120290576504063'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/14093712/posts/default/551120290576504063'/><link rel='alternate' type='text/html' href='http://jmlittle.blogspot.com/2008/06/recommended-disk-controllers-for-zfs.html' title='Recommended Disk Controllers for ZFS'/><author><name>jmlittle@gmail.com (Joe Little)</name><uri>http://www.blogger.com/profile/09731419203596760536</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>15</thr:total></entry><entry><id>tag:blogger.com,1999:blog-14093712.post-4913724434166364894</id><published>2008-05-31T23:22:00.000-07:00</published><updated>2008-05-31T23:32:17.827-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='zfs'/><category scheme='http://www.blogger.com/atom/ns#' term='opensolaris'/><category scheme='http://www.blogger.com/atom/ns#' term='SATA'/><title type='text'>Mixing SATA dos and donts</title><content type='html'>Another day, another bug seemingly hit. I've known for some time that mixing SATA-I and SATA-II devices on the same controller with regards to OpenSolaris seems to be unwise. I've already had systems with the initial ZFS-boot drive being a small capacity and thus likely SATA-I, but the data volumes were SATA-II. My recent issues with the iRAM could be related to having a SATA-I device after a SATA-II drive in the chain, but nothing has been concrete.&lt;br /&gt;&lt;br /&gt;However, today I discovered something else. One array I have is made up of all SATA-I drives and was used by a SATA-I RAID card that went south. I happily replaced it with the Marvell SATA-II JBOD card, and it was working just fine. I then lost the 6th of 7 drives, and went back to the manufacturer to try and buy a replacement. Sadly, these Raid Edition drives have been "updated" to be at a minimum of SATA-II for the same model. Replacing the failed SATA-I with the SATA-II worked, but on subsequent reboots, the 7th drive tended to not be enumerated by the Marvell card at startup, and even after re-inserting it, a "cfgadm" was necessary to activate it. Even then, a "zpool import" or "format" to introspect the now configured drive would wedge and never complete the command. Weird, right?&lt;br /&gt;&lt;br /&gt;The solution to return to stability was to swap the 6th and 7th drive, so that the SATA-II disk came after all the SATA-I devices in the chain. I'm not sure why it works, but every reboot works now, it never fails to enumerate that last drive and there is no need to manually cfgadm configure the drive post boot. Therefore, a set of truisms are starting to come together with mixed SATA drives. Whether Marvell, Sil3124, or the like, its never a good idea to mix SATA-I and SATA-II devices on a single controller, but if necessary, make sure that the SATA-II drives come after the SATA-I drives. The best configuration is to restrict SATA-I boot devices, such as small 5400 "laptop" drives to their own onboard SATA interface, and leave all SATA-II devices to add-on boards.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/14093712-4913724434166364894?l=jmlittle.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://jmlittle.blogspot.com/feeds/4913724434166364894/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=14093712&amp;postID=4913724434166364894' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/14093712/posts/default/4913724434166364894'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/14093712/posts/default/4913724434166364894'/><link rel='alternate' type='text/html' href='http://jmlittle.blogspot.com/2008/05/mixing-sata-dos-and-donts.html' title='Mixing SATA dos and donts'/><author><name>jmlittle@gmail.com (Joe Little)</name><uri>http://www.blogger.com/profile/09731419203596760536</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-14093712.post-7357360724483270491</id><published>2008-05-27T12:34:00.000-07:00</published><updated>2008-05-27T13:03:40.174-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='iram'/><category scheme='http://www.blogger.com/atom/ns#' term='log'/><category scheme='http://www.blogger.com/atom/ns#' term='zfs'/><title type='text'>The problem with slogs (How I lost everything!)...</title><content type='html'>A while back, I spoke of the virtues of using a slog device with ZFS. The system I went into production with had an Nvidia-based SATA controller onboard and a Gigabyte i-RAM card. No problems there, but at the time it was a cmdk driver (PATA mode) for my OpenSolaris-based NexentaStor NAS. After a while, I got an error where the i-RAM "reset" and the log went degraded. The system simple started to use the data disks for the intent log. So, no harm done. Its important to note that the kernel was a B70 OpenSolaris build.&lt;br /&gt;&lt;br /&gt;Later, I wanted to upgrade to NexentaStor 1.0, which had B85. Post upgrade or using a boot CD, it would never come up with the i-RAM attached. The newer kernel was an nv_sata driver, and I could always get it to work in B70, so I reverted to that. This is one nice feature that Nexenta has had for quite some time, in that the whole OS is checkpointed using ZFS to allow reversions if an upgrade doesn't take. Well, the NAS doesn't like having a degraded volume, so I've been trying to "fix" the log device. Currently, in ZFS, log devices cannot be removed, but only replaced. So, I tried to replace it using the "zpool" command. Replacing the failed log with itself always fails as its "currently in use by another zpool". I figured out a way around that, and that was to fully clear the log using something like "dd if=/dev/zero of=/dev/mylogdevice bs=64k". I was able to upgrade my system to B85, and then I attempted to replace the drive again, and it looked like it was working:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;  pool: data&lt;br /&gt; state: DEGRADED&lt;br /&gt;status: One or more devices is currently being resilvered.  The pool will&lt;br /&gt; continue to function, possibly in a degraded state.&lt;br /&gt;action: Wait for the resilver to complete.&lt;br /&gt; scrub: resilver in progress for 0h0m, 0.00% done, 450151253h54m to go&lt;br /&gt;config:&lt;br /&gt;&lt;br /&gt; NAME         STATE     READ WRITE CKSUM&lt;br /&gt; data         DEGRADED     0     0     0&lt;br /&gt;   raidz1     ONLINE       0     0     0&lt;br /&gt;     c7t0d0   ONLINE       0     0     0&lt;br /&gt;     c7t1d0   ONLINE       0     0     0&lt;br /&gt;     c6t1d0   ONLINE       0     0     0&lt;br /&gt; logs         DEGRADED     0     0     0&lt;br /&gt;   replacing  DEGRADED     0     0     0&lt;br /&gt;     c5d0     UNAVAIL      0     0     0  cannot open&lt;br /&gt;     c8t1d0   ONLINE       0     0     0&lt;br /&gt;&lt;br /&gt;errors: No known data errors&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;Note well, that it is replacing one log device with another (using the new nv_sata naming). However, after it reached 1% it would always restart the resilver with no ZFS activity, no snapshots, etc. The system was busy resilvering and resetting, getting no where. I decided to reboot to B70, and as soon as that came up, it started to resilver immediately and it proceeded after quite a long time for a 2GB drive to complete the resilver. So, everything was now fine, right?&lt;br /&gt;&lt;br /&gt;This is where things really went wrong. At the end of the resilver, it still considered the volume degraded, and looked like the above output but with only one log device. Rebooting the system, the volume started spewing out ZFS errors, and checksums counters went flying. My pool went offline. Another reboot, this time with the log device disconnected due to nv_sata not wanting it connected for booting purposes causes immediate kernel panics. What the hell was going on? Using the boot cd, I tried to import the volume. It told me that the volume had insufficient devices. A log device shouldn't be necessary for operation, as it hadn't needed it before. I attached the log device and ran &lt;b&gt;cfgadm&lt;/b&gt; to configure it, which works and gets around the boot time nv_sata/i-RAM issue. Now it told me that I have sufficient devices, but what happened next was worse. The output showed that my volume consisted of one RAIDZ, an empty log device definition, and additionally my i-RAM as an additional degraded drive added to the array as a stripe! No ZFS command was run here. It was simply the state of the system relative to what the previous resilver had accomplished.&lt;br /&gt;&lt;br /&gt;Any attempt to import the volume fails with a ZFS error regarding its inability to "iterate all the filesystems" or something to that affect. I was able to mount various ZFS volumes read-only by using the "zfs mount -o ro data/proj" or similar. I then brought up my network and manually had to transfer the files off to recover, but this pool is now dead to the world.&lt;br /&gt;&lt;br /&gt;What lessons have I learned? &lt;b&gt;Slog devices in ZFS, though a great feature, should not be used in production until they can be evacuated&lt;/b&gt;. There may be errors in the actions I took above, but bugs that I see include the inability for the nv_sata driver to deal with the i-RAM device for some odd reason, at least in B82 and B85 (as I've so far tested). The other bug is that a log replace appears to either not resilver at all (B85) or, when resilvering in older releases, causes the system to not correctly resilver the log but instead to shim the slog in as a data stripe. I simply can't see how that is by any stretch of the imagination by design.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/14093712-7357360724483270491?l=jmlittle.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://jmlittle.blogspot.com/feeds/7357360724483270491/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=14093712&amp;postID=7357360724483270491' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/14093712/posts/default/7357360724483270491'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/14093712/posts/default/7357360724483270491'/><link rel='alternate' type='text/html' href='http://jmlittle.blogspot.com/2008/05/problem-with-slogs-how-i-lost.html' title='The problem with slogs (How I lost everything!)...'/><author><name>jmlittle@gmail.com (Joe Little)</name><uri>http://www.blogger.com/profile/09731419203596760536</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-14093712.post-8274152979195584748</id><published>2008-05-03T22:40:00.000-07:00</published><updated>2008-05-03T23:07:33.490-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='zil_disable'/><category scheme='http://www.blogger.com/atom/ns#' term='adaptec'/><category scheme='http://www.blogger.com/atom/ns#' term='zfs'/><category scheme='http://www.blogger.com/atom/ns#' term='cache'/><title type='text'>ZFS: Is the ZIL always safe?</title><content type='html'>One of my ZFS-based appliances, used for long term backup, was upgraded from B70 to B85 of OpenSolaris two weeks ago. This time around, I re-installed the system to get RAIDZ2, and certain "hacks" that I've been using were no longer in place. The old settings were in /etc/system, and are the well known &lt;b&gt;zil_disable&lt;/b&gt; and &lt;b&gt;zfs_nocacheflush&lt;/b&gt; enabling. They were left there from when the system temporarily acted as a primary server for a short time with its Adaptec (aac) SATA RAID card and its accompanying SATA-I drives. Since the unit was UPS attached, it was relatively safe for NFS client access, and later on there was no direct client access over NFS. No harm done, and stable for quite some time over multiple upgrades from B36 or so, over a year without an error.&lt;br /&gt;&lt;br /&gt;A curious thing happened as soon as I upgraded without these somewhat unsafe settings for the kernel. I started to get tons of errors and twice my pool as gone completely offline until I cleared and scrubbed it. An example of the errors:&lt;br /&gt;&lt;code&gt;&lt;br /&gt; NAME        STATE     READ WRITE CKSUM&lt;br /&gt; tier2       DEGRADED     0     0     0&lt;br /&gt;   raidz2    DEGRADED     0     0     0&lt;br /&gt;     c1t1d0  FAULTED      0    64     0  too many errors&lt;br /&gt;     c1t2d0  DEGRADED     0    46     0  too many errors&lt;br /&gt;     c1t3d0  DEGRADED     0    32     0  too many errors&lt;br /&gt;     c1t4d0  DEGRADED     0    47     0  too many errors&lt;br /&gt;     c1t5d0  DEGRADED     0    39     0  too many errors&lt;br /&gt;     c1t6d0  FAULTED      0   118     0  too many errors&lt;br /&gt;     c1t7d0  DEGRADED     0    57     0  too many errors&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;Nothing explained the turnaround from stable to useless for any writes. I also got some read errors, and no nightly rsync against this tree would survive without incrementing some error count. Was it somehow one of my cache settings on the adaptec card that conflicted with a new version of the "aac" driver? I thought I would need to isolate it, revert perhaps, or consider that somehow my card was simply dying. Perhaps the cache/RAM on the card itself was toast.&lt;br /&gt;&lt;br /&gt;A recent post on the opensolaris-discuss mailing lists gave me an idea. Mike DeMarco suggested to a user suffering from repeated crashes that corrupt ZFS until cleared to try and use zil_disable to test &lt;b&gt;"if zfs write cache of many small files on large FS is causing the problems."&lt;/b&gt; Makes some sense if the card is somehow trashing on small writes. The use of it for backup means that its being read and written to via rsync and can involve many small updates. I also had various read errors pop up. So, I put the old faithful zil_disable and for good measure the zfs_nocacheflush back after another degraded pool, and after a reboot and scrub, let it do its nightly multi-terabyte delta rsyncs. After a few days, &lt;b&gt;there are no errors&lt;/b&gt;. Have I stumbled onto some code path bug that is ameliorated by these kernel options? Do newer kernels have suspect aac drivers?&lt;br /&gt;&lt;br /&gt;Perhaps someone will prove the logic of the above all wrong, but for now, I'm returning to the old standby "unsafe" kernel options to keep my pool stable.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/14093712-8274152979195584748?l=jmlittle.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://jmlittle.blogspot.com/feeds/8274152979195584748/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=14093712&amp;postID=8274152979195584748' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/14093712/posts/default/8274152979195584748'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/14093712/posts/default/8274152979195584748'/><link rel='alternate' type='text/html' href='http://jmlittle.blogspot.com/2008/05/zfs-is-zil-always-safe.html' title='ZFS: Is the ZIL always safe?'/><author><name>jmlittle@gmail.com (Joe Little)</name><uri>http://www.blogger.com/profile/09731419203596760536</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-14093712.post-7297445071636576663</id><published>2008-04-03T14:54:00.000-07:00</published><updated>2008-04-03T15:22:48.407-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='migration'/><category scheme='http://www.blogger.com/atom/ns#' term='opendirectory'/><category scheme='http://www.blogger.com/atom/ns#' term='samba'/><category scheme='http://www.blogger.com/atom/ns#' term='upgrade'/><category scheme='http://www.blogger.com/atom/ns#' term='ldap'/><category scheme='http://www.blogger.com/atom/ns#' term='macosx server'/><category scheme='http://www.blogger.com/atom/ns#' term='10.5'/><category scheme='http://www.blogger.com/atom/ns#' term='sid'/><category scheme='http://www.blogger.com/atom/ns#' term='10.4'/><title type='text'>OpenDirectory upgrade path from 10.4 to 10.5</title><content type='html'>&lt;div&gt;In EE we've migrated over from various AD and OpenLDAP installations to what we hope is a more manageable solution long term. Sadly, upgrading OpenDirectory (MacOSX OpenLDAP-based directory services) from 10.4 to 10.5 doesn't work as Apple states it would. Here's the complete recipe we used to keep our data, our passwords, and most importantly, our domain SID. Apple tends to not care about maintaining the SID in various replica-to-master promotion steps.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;First, a reference to the cookbook &lt;a href="http://www.netmojo.ca/blog/2007/11/13/tiger-to-leopard-server-migration-part-two/"&gt; doing things the hardway&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;As recommended in the above and from other postings, upgrades do not work. Rather, what needs to be done is this:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;10.4 Server&lt;/span&gt;:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;1&lt;/span&gt;) go to Server Admin, OpenDirectory, and under the Archive tab, generate an archive of the OpenDirectory DB. Place in admin home directory&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;2&lt;/span&gt;) For safe keeping, go to /var/db/samba and get the secrets.tdb file. Place in admin home directory (readable by all)&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;3&lt;/span&gt;) get the current SID by running as root/sudo "net getdomainsid EE" where EE is the domain we are supporting. Place in home directory&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;4&lt;/span&gt;) copy off to a 3rd party machine the above three files/directories&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;10.5 Server&lt;/span&gt;:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;1&lt;/span&gt;) Install fresh, and use the exact same IP and name as the 10.4 Server. You'll likely need to have these are their own net. Also note that without a link on the primary interface, smb, dns, and opendirectory don't work. I suggest connecting to the third party machine listed above, in my case my laptop's physical connection which I assign to the private net&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;2&lt;/span&gt;) You'll need DNS setup temporarily, so create a DNS server for your domain (stanford.edu) and create a host entry for your self. Point local network settings to self as DNS server&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;3&lt;/span&gt;) copy over the files saved from 10.4 from the laptop/3rd party machine&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;4&lt;/span&gt;) Make an OpenDirectory Master, using the correct domain "dc=ee,dc=stanford,dc=edu" and correct KRB realm "EE.STANFORD.EDU"&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;5&lt;/span&gt;) import the archive of 10.4&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;6&lt;/span&gt;) run as root "mkpassdb -kerberize"&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;7&lt;/span&gt;) Create a new PDC config for Windows. Use the directoryadmin account/password to give samba correct access to the OpenDirectory DB&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;8&lt;/span&gt;) edit /var/db/smb.conf to fit the /etc/smb.conf entries you had on 10.4. Likely you'll want to make "local path = " and add "admin users = directoryadmin, domainjoin, @admin" or the like, where the first is the directory admin acct, the second is a PDC join account that can't login, but has directory admin rights. @admin works to include anyone in admin group&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;9&lt;/span&gt;) run as root "chflags uchg /var/db/smb.conf" to freeze your samba config. Recommend making a copy as well in the same dir.&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;10&lt;/span&gt;) run as root "net setdomainsid (SID)" where SID is the one you saved from 10.4&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;11&lt;/span&gt;) Go into Workgroup Manager. Change preferences to enable Inspector. Go into Inspector and select "Config" and then "CIFSServer". The two Value lines with "xml version.." need to have Edit run against them, and replace the SID line in each with the SID you just used.&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;12&lt;/span&gt;) restart Samba/Windows services. Check SID with, as root, "net getdomainsid" and "net getlocalsid EE" or the like. If anything didn't stick, do 10, 11 again.&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;13&lt;/span&gt;) before going live, one needs to remove reference to the local DNS in Network preferences, and optionally disable DNS service. This setup also was only tested with Wins service enabled as the WINS Server&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;14&lt;/span&gt;) test, test, test from Windows including domain logins, enumeration of groups in windows for adding domain users, etc. Logs may show if accounts are failing.&lt;br /&gt;&lt;br /&gt;On Windows, the simple tests you can do involve the utility "nltest" which is in the free SUPPORT TOOLS (but may not be installed by default).  nltest /?  gives commands although OS-X samba only supports some of them.&lt;br /&gt;&lt;br /&gt;..to list PDC and BDCs ---  nltest /dclist:your_domain&lt;br /&gt;&lt;br /&gt;nltest /dclist:ee&lt;br /&gt; Domain 'ee' is pre Windows 2000 domain.  (Using NetServerEnum).&lt;br /&gt;  List of DCs in Domain ee&lt;br /&gt;   \\EE-OD (PDC)&lt;br /&gt; The command completed successfully&lt;br /&gt;&lt;br /&gt;..to verify schannel ---  nltest /sc_query:your_domain&lt;br /&gt; C:\&gt;nltest /sc_query:ee&lt;br /&gt;  Flags: 0&lt;br /&gt;  Trusted DC Name \\EE-OD&lt;br /&gt;  Trusted DC Connection Status Status = 0 0x0 NERR_Success&lt;br /&gt;  The command completed successfully&lt;br /&gt;&lt;br /&gt;To do a more detailed check, you can open  the Windows Manager and try to look at the members of the Administrator group for the machine. When we had trouble, it just showed raw SID numbers, even for EE\DomAdmins. Once it was fixed, then that showed correctly.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;Error cheat sheet&lt;/span&gt;:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;1&lt;/span&gt;. If smb logs show that directoryadmin or domainjoin and the like have the "wrong sid" in passdb, you'll need to demote/promote Windows Servers to workgroup and back to PDC. You'll need to run "chflags nouchg /var/db/smb.conf" first and copy back your copied version after repromotion as the file will be rewritten. Do step 9-12 again above&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;2&lt;/span&gt;. If kerberos isn't effectively working on clients, you may need to reimport the archive OpenDirectory, rerun "mkpassdb -kerberize" and follow the above demote/promote steps.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/14093712-7297445071636576663?l=jmlittle.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://jmlittle.blogspot.com/feeds/7297445071636576663/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=14093712&amp;postID=7297445071636576663' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/14093712/posts/default/7297445071636576663'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/14093712/posts/default/7297445071636576663'/><link rel='alternate' type='text/html' href='http://jmlittle.blogspot.com/2008/04/opendirectory-upgrade-path-from-104-to.html' title='OpenDirectory upgrade path from 10.4 to 10.5'/><author><name>jmlittle@gmail.com (Joe Little)</name><uri>http://www.blogger.com/profile/09731419203596760536</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-14093712.post-7528950546224624295</id><published>2008-04-03T14:44:00.000-07:00</published><updated>2008-04-03T16:53:37.429-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='nexenta shell developer root'/><title type='text'>Have NAS, Want Shell</title><content type='html'>Now that anyone can grab Nexenta's NAS product, many will undoubtedly want to get under the hood, especially developers. First, a fair warning that although the management infrastructure is resilient to many changes done manually, modifying various service configurations outside of Nexenta's internal version control can lead to one or two headaches if you aren't careful. That said, give me a shell!&lt;br /&gt;&lt;br /&gt;Well, that's simple. When you login via the console (ssh, for example), simply run "setup appliance nmc edit-seettings". You can tab your way through that command as well. Once there, go and edit expert_mode to be "1". Yes, you've enter the "vi" command zone, so save and exit with ':wq'&lt;br /&gt;&lt;br /&gt;Once the changes are saved, you'll be asked to refresh the console settings, and now you can type "!bash" to get a nice usable shell, or bang escape any command. You'll be root, so, be aware and behave! Now you know what Nexenta Core was all about, as its all there at your fingertips, along with NMS, NMC, and NMV subsystems that are the foundation of the NAS product.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style:italic;"&gt;update&lt;/span&gt;:&lt;br /&gt;I was told that an alternative way to set expert mode is &lt;span style="font-weight:bold;"&gt;&lt;blockquote&gt;option expert_mode = 1 -s&lt;/blockquote&gt;&lt;/span&gt; as denoted in the "option -h" documentation for NMC. The "-s" flag updates the on-disk configuration.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/14093712-7528950546224624295?l=jmlittle.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://jmlittle.blogspot.com/feeds/7528950546224624295/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=14093712&amp;postID=7528950546224624295' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/14093712/posts/default/7528950546224624295'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/14093712/posts/default/7528950546224624295'/><link rel='alternate' type='text/html' href='http://jmlittle.blogspot.com/2008/04/have-nas-want-shell.html' title='Have NAS, Want Shell'/><author><name>jmlittle@gmail.com (Joe Little)</name><uri>http://www.blogger.com/profile/09731419203596760536</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-14093712.post-260036965832390606</id><published>2008-04-03T08:30:00.000-07:00</published><updated>2008-04-03T15:24:39.262-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='nexenta'/><category scheme='http://www.blogger.com/atom/ns#' term='free nas'/><category scheme='http://www.blogger.com/atom/ns#' term='developer'/><title type='text'>Developers, developers, developers...</title><content type='html'>Ever wanted that NAS on your own hardware, for free? Nexenta has finally released their &lt;a href="http://www.nexenta.com/corp/index.php?option=com_content&amp;task=view&amp;id=18&amp;Itemid=75"&gt;NexentaStor Developer Edition 1.0&lt;/a&gt;, which is free version of their commercial product with only a 1TB limit on used storage. All functionality otherwise is there, unlimited. This is a near final release for the commercial version, but is the first version the general public can get and install on their own hardware.&lt;br /&gt;&lt;br /&gt;The release represents many things, but the Developer releases are focused on more than just tire kicking or a free NAS product for your home NAS needs. Rather, there is a lot of potential to extend and use Nexenta's SA-API for storage service-enabled solutions. Wish to modify your DB to wrap a transaction in a snapshot? Need to automate separate file system creation, quotas, etc for your users? Registered users on the web site can look at an overview of the architecture and sample SA-API components. I expect much more in the way of API details in the near future. However, the release of the product is here today.&lt;br /&gt;&lt;br /&gt;A general support forum is &lt;a href="http://www.nexenta.com/forum"&gt;also available&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;There are two other automation aspects to NexentaStor that I haven't given much love to here. Both utilize the batch nature of NMC, the Nexenta Management Console. One is the 'query' functionality, which allows various introspections on the NAS and can query across multiple appliances at once if they are grouped together (the group function). In a similar vein, there is the NMC recording facility, which is handled by the "record" command. Recording allows you to save and play back actions for various tasks, including over a network of NAS devices. All of these commands have ready examples available by invoking the command with the "-h" help argument in the console. There is also good stuff in the User Guide which is available for &lt;a href="http://www.nexenta.com/corp/index.php?option=com_content&amp;task=view&amp;id=50&amp;Itemid=72"&gt;download&lt;/a&gt;.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/14093712-260036965832390606?l=jmlittle.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://jmlittle.blogspot.com/feeds/260036965832390606/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=14093712&amp;postID=260036965832390606' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/14093712/posts/default/260036965832390606'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/14093712/posts/default/260036965832390606'/><link rel='alternate' type='text/html' href='http://jmlittle.blogspot.com/2008/04/developers-developers-developers.html' title='Developers, developers, developers...'/><author><name>jmlittle@gmail.com (Joe Little)</name><uri>http://www.blogger.com/profile/09731419203596760536</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-14093712.post-564655632150064659</id><published>2008-03-21T13:17:00.000-07:00</published><updated>2008-04-03T15:23:35.866-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='active directory'/><category scheme='http://www.blogger.com/atom/ns#' term='nexenta'/><category scheme='http://www.blogger.com/atom/ns#' term='cifs'/><category scheme='http://www.blogger.com/atom/ns#' term='opensolaris'/><title type='text'>Step by Step CIFS Server setup with OpenSolaris</title><content type='html'>After CIFS Server was released into the OpenSolaris wild, I could not for the life of me get it to work. Even in the post B82 stage, the random collection of documentation led me astray multiple ways. I think part of the problem is that I read up on it too much and thus old requirements were no longer accurate and got in the way. You need to setup your krb5.conf file right? LDAP too? The final resolution appears to be rather straight forward, and it likely shows other steps I had taken previously were left rotting on my system and prevented a working solution.&lt;br&gt;&lt;br /&gt;So, what do you actually need? I'd recommend starting with at least B85. In my case I used the latest NexentaOS unstable release (1.0.1 to be) which includes B85 and by default the necessary Sun smb packages. For my test, I created a contrived domain using Windows 2003 Server (SP2) called WIN.NEXENTA.ORG. The rest follows:&lt;br /&gt;&lt;code&gt;&lt;br /&gt;add to /etc/resolv.conf:&lt;br /&gt;nameserver 172.24.101.71&lt;br /&gt;domain win.nexenta.org&lt;br /&gt;search win.nexenta.org&lt;br /&gt;(Nameserver is our AD DNS server)&lt;br /&gt;&lt;br /&gt;(optional: run ntpdate against your time server)&lt;br /&gt;#svcadm enable svc:/network/ntp:default&lt;br /&gt;#svcadm enable -r smb/server&lt;br /&gt;#smbadm join -u Administrator win.nexenta.org&lt;br /&gt;&lt;br /&gt;#zfs set sharesmb=on data/myshare&lt;br /&gt;#zfs set sharesmb=name=myshare data/myshare&lt;br /&gt;&lt;br /&gt;#mkdir /data/myshare/jlittle&lt;br /&gt;#chown jlittle /data/myshare/jlittle&lt;br /&gt;&lt;br /&gt;#idmap add 'winuser:*' 'unixuser:*'&lt;br /&gt;#idmap add "wingroup:Domain Users' 'unixgroup:staff'&lt;br /&gt;&lt;br /&gt;#svcadm restart smb/server&lt;br /&gt;#svcadm restart idmap&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;Other advisable steps include "zfs set casesensitivity=mixed data/share" for correctness of Windows users, but likely not ideal if the zfs filesystem shared is also shared to NFS clients. You know if its all working if "idmap dump" gives you real values and not just returns to the prompt. I connected to my new share via a MacOSX client, and made sure my domain matched as "win.nexenta.org" when connecting to my share (aka smb://server/myshare/jlittle).&lt;br /&gt;&lt;br /&gt;In the end, it was much simpler than the documents suggested. I had to avoid explicitly stating the domains in idmap to make idmap do the right thing. You should pick the right local group for your users in the mapping for groups. I picked "staff" as that was the default group of my user.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/14093712-564655632150064659?l=jmlittle.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://jmlittle.blogspot.com/feeds/564655632150064659/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=14093712&amp;postID=564655632150064659' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/14093712/posts/default/564655632150064659'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/14093712/posts/default/564655632150064659'/><link rel='alternate' type='text/html' href='http://jmlittle.blogspot.com/2008/03/step-by-step-cifs-server-setup-with.html' title='Step by Step CIFS Server setup with OpenSolaris'/><author><name>jmlittle@gmail.com (Joe Little)</name><uri>http://www.blogger.com/profile/09731419203596760536</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-14093712.post-8576520263326097769</id><published>2008-03-03T21:39:00.000-08:00</published><updated>2008-03-03T22:01:36.733-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='zfs manageability cdp pNFS lustre'/><title type='text'>Random Storage Comments, Answered</title><content type='html'>In my last posting, a lot of comments covered wide and varied ground. First, its important to note that even with CDP underlying ZFS pools, ZFS itself provides for its own integrity of state.  If CDP didn't complete a transaction, a re-sync will generally resolve it, but the actual hosted ZFS filesystem need not fear and its transactions won't be finished until the write is checksumed. I agree that there are failure modes here, but that leads to a good quote in one of the comments:&lt;br /&gt;&lt;br /&gt;&lt;div&gt;"To that end, it seems to be that whenever a choice can be had between doing something simple to accomplish a goal and chaining a bunch of parts together to accomplish the same goal with more sophistication, its likely the simpler solution will be more sustainable over time."&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I concur. Nexenta marries two pieces of functionality to get auto-cdp, and they rely on the two components in whole to maintain overall simplicity of implementation. The real value that they have provided is in making the front end dead simple. If the management isn't simplified, any level of underlying functionality will be lost in the long run.&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I want to focus more on the simplicity of the performant NAS solutions. Mentioning pNFS, lustre, and the like, we know that the client becomes a bit less transparent, and definitely the backend store of data becomes somewhat opaque as data is no longer consistent per one server, but is spread out across the whole back end. Even though you need newer clients with specific functionality in both cases, it can again be more simple than the alternative, which generally involves an NFS v3 client using automounts, LDAP-based auto mount maps, and heavy handed data management on the backend to scale out in similar ways over multiple NAS heads. The tact of taking a single high end head with best of breed backend hardware, such as IB interconnects to SAS disk arrays and 10GB ethernet out the front might seem to work, but we have already seen pathological conditions where a single heavy client writing millions of small files can make that enhanced hardware meaningless for performance.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;There is no fast answer to solving both scale out with regards to capacity and performance without a little give on each aspect of the design. What makes it all reasonable to consider is if the entire solution is made greatly more simple to manage than the alternatives at either end of the design spectrum. In the end, simplicity of manageability will trump other considerations. As long as simplicity is strictly maintained in the product, the underlying complexity will seem well worth the effort. We just need to trust that someone gets the fine details. In the end, we don't mind that we can't muck much with a highly efficient but high performance car. As long as we feel mastery over its operation and trust in the quality of the build and service by the manufacturer, we all are willing to make the investment.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/14093712-8576520263326097769?l=jmlittle.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://jmlittle.blogspot.com/feeds/8576520263326097769/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=14093712&amp;postID=8576520263326097769' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/14093712/posts/default/8576520263326097769'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/14093712/posts/default/8576520263326097769'/><link rel='alternate' type='text/html' href='http://jmlittle.blogspot.com/2008/03/random-storage-comments-answered.html' title='Random Storage Comments, Answered'/><author><name>jmlittle@gmail.com (Joe Little)</name><uri>http://www.blogger.com/profile/09731419203596760536</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-14093712.post-6714845866339878940</id><published>2008-02-07T15:33:00.000-08:00</published><updated>2008-02-07T16:06:52.113-08:00</updated><title type='text'>The ZFS scaling and DR question</title><content type='html'>In my dealings with using ZFS-based NAS and 2nd tier solutions, I've been blessed to hear from different people with thoughts that push the discussion forward. The ZFS space is where a lot of long term strategies that utilized commoditized components are covered. Other spaces that I follow seem to be somewhat stale, or consider specialized point systems that solve problems between just two or three chosen vendors. ZFS is open, so I think its ecosystem can only grow. I'm happy to have permission from Wayne Wilson to re-state one of his emails, and use it for a discussion point.&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;I see two different paths to scale:&lt;br /&gt;&lt;br /&gt;1) Use a single system to run NexentaStor and have it mount&lt;/pre&gt;&lt;pre&gt;the remaining systems via iSCSI.  This would probably have to&lt;/pre&gt;&lt;pre&gt;be done over 10 Gbs or infiniband links.&lt;/pre&gt;&lt;pre&gt;&lt;br /&gt;2) Create Multiple NexentaStor systems and only present them&lt;/pre&gt;&lt;pre&gt;as a unified file system via NFS V4.  This would leave the&lt;/pre&gt;&lt;pre&gt;CIFS clients restricted to mounting multiple volumes, but that&lt;/pre&gt;&lt;pre&gt;may be ok.&lt;/pre&gt;&lt;pre&gt;&lt;br /&gt;&lt;br /&gt;&lt;/pre&gt;&lt;pre&gt;Is this it or is there some other way?&lt;br /&gt;&lt;/pre&gt;&lt;pre&gt;&lt;br /&gt;Next architectural issue is how to do DR and archiving.&lt;br /&gt;&lt;br /&gt;There are two types of DR - Human initiated and machine initiated.&lt;/pre&gt;&lt;pre&gt;&lt;br /&gt; My standard strategy for the Human initiated DR is to use&lt;/pre&gt;&lt;pre&gt;snapshots and keep enough of them around to answer most restore&lt;/pre&gt;&lt;pre&gt;requests. For machine initiated, my worst case is when the&lt;/pre&gt;&lt;pre&gt;storage subsystem (either a complete Thumper or a complete array)&lt;/pre&gt;&lt;pre&gt;fails.  For this I can find no other solution than to replicate.&lt;/pre&gt;&lt;pre&gt;&lt;br /&gt;&lt;/pre&gt;&lt;pre&gt;  As you have pointed out, replication usually locks you&lt;/pre&gt;&lt;pre&gt;into the backend storage vendors system, whereas it would be&lt;/pre&gt;&lt;pre&gt;better for us consumers to be able to 'mirror' across two&lt;/pre&gt;&lt;pre&gt;disparate storage backends.&lt;/pre&gt;&lt;pre&gt;&lt;br /&gt;Here is where things might fall apart in using Thumpers.  We&lt;/pre&gt;&lt;pre&gt;could probably spec out a really high I/O throughput 'head end'&lt;/pre&gt;&lt;pre&gt;type server to load Nexenta on.  Then we could present any kind&lt;/pre&gt;&lt;pre&gt;of storage as iSCSI LUN's to the system. Let's say we use a&lt;/pre&gt;&lt;pre&gt;Thumper 48TB system to as an iSCSI target for our head end, then&lt;/pre&gt;&lt;pre&gt;we could use an Infortrends (or some such) SATA/iSCSI array for&lt;/pre&gt;&lt;pre&gt;other LUN's and let ZFS mirror across them.......and rely on&lt;/pre&gt;&lt;pre&gt;snapshots for human based DR and mirror failure protection for&lt;/pre&gt;&lt;pre&gt;machine based DR.&lt;/pre&gt;&lt;pre&gt;&lt;br /&gt;&lt;br /&gt;Then that leaves us with Archiving.  I think that here is where&lt;/pre&gt;&lt;pre&gt;a time based tier, or at least the ability to define a tier based&lt;/pre&gt;&lt;pre&gt;on age would be useful.  If we set the age to a point beyond which&lt;/pre&gt;&lt;pre&gt;most changes are taking place (letting snapshot's take the brunt&lt;/pre&gt;&lt;pre&gt;of the change load before archiving), then it is likely that we&lt;/pre&gt;&lt;pre&gt;will have just one copy of the data to archive off, for most files.&lt;/pre&gt;&lt;pre&gt;&lt;br /&gt;What we would want to do is the make tape the archival tier.  I am&lt;/pre&gt;&lt;pre&gt;uncertain as to how to do this.  Should it be done using vendor&lt;/pre&gt;&lt;pre&gt;backup software to allow catalogs and GUI retrievals?&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;&lt;div&gt;&lt;br /&gt;Wayne covers how to scale out NAS heads based on ZFS, as well as the standard DR and archive question.&lt;br /&gt;&lt;br /&gt;Considering the NAS head scenario, it is plain that long term, running storage across multiple NAS heads using upcoming technologies as they mature, such as Lustre or pNFS, will be necessary to scale out with both performance and capacity. However, it is reasonable to consider solutions that utilize DAS/IB/external SAS/iSCSI to take one head node and approach petabyte levels. I consider this within reason if the target is second tier or digital archive storage, where performance isn't king. With the best of hardware, perhaps the single head node (or HA config) will have sufficient performance for most primary storage deployments. Time will tell as our needs require such solutions and technologies we have employed improve.&lt;br /&gt;&lt;br /&gt;Disaster Recover I brought up in a recent post. Using file based backup solutions works well in the act of backing up, but restoration at a large scale is wanting especially if file numbers become more dominant than file size. The next beta of NexentaStor happily has taken a large leap forward in addressing this by implementing a very simply to manage auto-cdp (continuous data protection) service across multiple NAS heads. This keeps multiple storage solutions in sync as data is committed, operates below the ZFS layer, and is bidirectional. Yes, the secondary system can import the same LUNs or ZFS pools and re-export them to your clients. Just as important, if you lost the primary host, synchronization can be reversed to restore at full block level speeds your primary system.&lt;br /&gt;&lt;br /&gt;If you take this approach, and also consider exposing native zvols or NFS/CIFS to your clients (such as a mail server), they too can use their local DAS storage under any OS and filesystem, but they can use native backup solutions to the ZFS exported volumes to regularly backup block-level dumps to allow speedy block-level restores. A mixture with this and file level backups even permits less frequent full dumps and greater granularity in recovery. In the end, you'd hope to have these wonderful features on your server OS directly to prevent having to do DR, but you can see that we are approaching reasonable answers.&lt;br /&gt;&lt;br /&gt;The final issue brought up is archival, and I hope my previous posts have gone far in answering it. In general, I believe disk based archival solutions need to be employed before tape is considered, and tape should be fully regulated to final archival stages only. Today, you can use multiple open (Amanda/Bacula) and closed backup software solutions to write to tape libraries from trailing edge snapshots. I also know that though in its infancy, the NDMP client services evolving for ZFS will someday allow easier integration into current backup systems, allowing most people to convert existing tape based solutions completely into their last tier archive, running infrequently for long periods with just full backups.&lt;br /&gt;&lt;br /&gt;All the above is just my "its getting better" perspective. Perhaps you can find some glaring weakness. I hope shortly you can all see the auto-cdp service that Nexenta has put together in action. Its well worth the wait.&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/14093712-6714845866339878940?l=jmlittle.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://jmlittle.blogspot.com/feeds/6714845866339878940/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=14093712&amp;postID=6714845866339878940' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/14093712/posts/default/6714845866339878940'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/14093712/posts/default/6714845866339878940'/><link rel='alternate' type='text/html' href='http://jmlittle.blogspot.com/2008/02/zfs-scaling-and-dr-question.html' title='The ZFS scaling and DR question'/><author><name>jmlittle@gmail.com (Joe Little)</name><uri>http://www.blogger.com/profile/09731419203596760536</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-14093712.post-5953816670279063804</id><published>2008-01-17T14:10:00.000-08:00</published><updated>2008-01-17T14:45:20.206-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='iram'/><category scheme='http://www.blogger.com/atom/ns#' term='slog'/><category scheme='http://www.blogger.com/atom/ns#' term='log'/><category scheme='http://www.blogger.com/atom/ns#' term='zfs'/><category scheme='http://www.blogger.com/atom/ns#' term='nfs'/><category scheme='http://www.blogger.com/atom/ns#' term='nexentastor'/><category scheme='http://www.blogger.com/atom/ns#' term='iscsi'/><title type='text'>Using the iRam: Improving ZFS perceived transaction latency</title><content type='html'>I've been long overdue in reviewing the Gigabyte iRam card and its affect on performance of your favorite ZFS NAS product. NexentaStor already supports log devices, so the time appeared right to get one of these for a client I consult with to help deal with the noticeable pauses one can see when heavy reads and writes compete on a ZFS pool. I hope that the single threaded nature of those commits is resolved at some future point, but the iRam card appears to be a simple way to inject an NVRAM-like device into your commodity NAS solution.&lt;br /&gt;&lt;br /&gt;The card itself is simply four DIMM sockets for DDR RAM, with a battery backup, reset switch, power driven from a PCI bus, and a single SATA-I connection to plug the unit into your existing SATA interfaces. Already you can see that the performance limit is 150MB/sec based on the SATA-I spec. What does this card do though? Near instant reads and writes in a safe battery-backed ramdisk that your system sees as a 2GB or 4GB drive, just what you'd want for a dedicated write commit device. In the case of many spindles in an array, you likely can do better than this device for true performance, but in the case of many small commits, the near perfect latency of RAM is much more ideal to keep writes happening without stalling the drives for reads. Since its a "slog" device by ZFS terms, it will regularly commit to the real underlying storage at full disk bandwidth. Therefore, even when writes must compete with reads on the physical disk, you limit your exposure to perceived stalls in I/O request even in the higher load cases.&lt;br /&gt;&lt;br /&gt;For my non-production test, I actually put together the worse case scenario: An iSCSI backed ZFS array with NFS clients and many small files. In this case, any NFS writes require 3 fsyncs on the back end storage as required by NFS (create,modify,close). This is actually similar to CAD libraries, which the test was made to reflect. Using iSCSI devices, you can inflict much higher latencies. My iSCSI targets are actually older SATA-I drives themselves on a SBEi Linux based target using 3ware 8500s. Again, no where near ideal.&lt;br /&gt;&lt;br /&gt;Creating a directory of 5000 small 8k files, I copied this from a linux gig-e connected client to a ZFS pool (made of two non-striped iSCSI luns), and got a meager 200K/sec write performance over NFS. If I stripe the data instead in the ZFS pool, I increased the numbers to 600K/sec at some points. Adding a 2GB Gigabyte iRam drive, I increased those numbers up to 9MB/sec, but averaging around 5MB/sec overall. That's at least 10 times the performance. Again, this test involves many i/o operations instead of using any bandwidth.&lt;br /&gt;&lt;br /&gt;How fast can data be written to and read from that log device? My tests showed that 100MB/sec for reads and writes were common, with writes only bursting to those numbers for larger streaming data sets. In the case of the iSCSI nodes in question, each one could be pulled at a top rate of 45MB/sec, but averaging closer to 27MB/sec. Nominally, you can see that we are 3x better than at least these gig-e iSCSI devices.&lt;br /&gt;&lt;br /&gt;The final production installation of the iRam device was with a SATA-II DAS array, and even in heavier load scenarios, we saw the wait cycle for write commits to the drives limited, and a steady 100+MB/sec use of the commit log (reads and writes). The only caveat for using such a device is that the current builds of OpenSolaris and thus NexentaStor do not allow you to remove it once added to a pool. A future release is supposed to address that.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/14093712-5953816670279063804?l=jmlittle.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://jmlittle.blogspot.com/feeds/5953816670279063804/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=14093712&amp;postID=5953816670279063804' title='5 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/14093712/posts/default/5953816670279063804'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/14093712/posts/default/5953816670279063804'/><link rel='alternate' type='text/html' href='http://jmlittle.blogspot.com/2008/01/using-iram-improving-zfs-perceived.html' title='Using the iRam: Improving ZFS perceived transaction latency'/><author><name>jmlittle@gmail.com (Joe Little)</name><uri>http://www.blogger.com/profile/09731419203596760536</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>5</thr:total></entry><entry><id>tag:blogger.com,1999:blog-14093712.post-6097848112223487987</id><published>2008-01-11T14:29:00.001-08:00</published><updated>2008-04-03T15:24:20.827-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='lustre'/><category scheme='http://www.blogger.com/atom/ns#' term='zfs'/><category scheme='http://www.blogger.com/atom/ns#' term='opensolaris'/><category scheme='http://www.blogger.com/atom/ns#' term='disaster recovery'/><category scheme='http://www.blogger.com/atom/ns#' term='iscsi'/><category scheme='http://www.blogger.com/atom/ns#' term='block device'/><title type='text'>Swept Under the Rug</title><content type='html'>In our day to day management of technology, we tend to pick paths that resolve the most pressing pain points. Inadvertently, we often also sweep certain problems under the rug, awaiting the day when it all must be cleaned up. Many choices do exactly this, solving the present problem while creating perhaps larger problems down the road. In my evolving strategy on storage, the move away from tape to disk-based online storage solves the most obvious problems but creates a whole series of other problems, including file based disaster recovery, long term maintenance of the underlying disk technology, true long term persistence of data, and general accessibility of the data by future technology. Today, I'll focus only on our next major pain point, disaster recovery.&lt;br /&gt;&lt;br /&gt;Recently, a few instances occurred that underscored the need for better thought out solutions than what we already put in place. We thought we may be ahead of the curve with tiered copies of data on secondary NAS solutions, with our backup windows well within reason. Its obvious we made the right choice in doing incremental file based backups to secondary NAS, as the end data containers are universal across network file protocols. Recovery of any given file or perhaps even full data store recovery still beats that of tape libraries multiple times over. However, the architecture in place has allowed us to scale from the gigabyte world to the terabyte world. Our backup windows are well in hand, and spot recovery is a cinch. But there are some problematic disaster recovery scenarios.&lt;br /&gt;&lt;br /&gt;The first scenario was just felt a week ago. A mere 50GB file store of Maildir formatted mail, where each message is a file itself, with mail folders represented by directories, had write errors on its underlying Linux XFS volume. This is by far not our largest install of such. Various mail servers for separate organizations we deal with are over 500GB in size. We suspect the RAID card's NVRAM was toast, disallowing further writes, and we had to migrate the mail to another server quickly. Simple enough, let's recover from our second tier mail store, right? The attempt was made, but we found ourselves limited not by the reading of millions of small 1K files so much as recommitting those files onto a journaled filesystem. The metadata updates of the files alone were bad enough. In the end, we were limited by file operations per second, and not pure bandwidth to the disk. Our estimated time of recovery was a minimum of 14 hours, and only for 50GB. A clue to the long term solution to this was in how we restored everything in less than 2 hours. In this case, we relied upon an xfsdump from the read-only failing array to a new filesystem on the spare hardware. &lt;br /&gt;&lt;br /&gt;The obvious up front answer to disaster recovery of data in a multi-terabyte world is to make sure you have copies of everything in as close to a high availability setup as you can afford. If the underlying RAID array was actually two arrays, with software mirroring across the two, or if it were two separate machines that either attached to shared mirrored arrays or otherwise mirror their underlying RAID arrays over the network, we'd all just worry about natural disasters. Preventing the true disaster recovery scenario up front is the only true way to win, but most of us simply don't have the luxury, the resources, or the ability to safely migrate the myriad of production or otherwise in use solutions over to the ideal configuration. We can all try to reach this nirvana, but its simply not as attainable to most of us as we'd like.&lt;br /&gt;&lt;br /&gt;We can, however, address some of the pain of the disaster recovery scenario from disk based solutions. The iSCSI and SAN vendors have been on this for some time, and have extolled the virtues of block based storage. Using such, you can stream I/O at near the theoretical limits of the hardware. However, running all your systems against a SAN throws you down the path of the usual hardware based solution to a general problem, with the usual vendor lock-in quibbles.  We already have bought into the software based approach that NexentaStor has offered us, and happily, they already provide a similar solution to our needs. With thin-provisioning of virtualized storage volumes (zvols), one can expose block level storage to clients but still treat them as snapshot capable files, use file level services and such on the back end second level NAS. The clients will generally access these through iSCSI, and they can either directly depend on these network-based volumes as if they were local filesystems, or simply use their filesystem native dump programs to periodically maintain a near synchronized copy of a true DAS filesystem to a second tier block level copy. The latter is nice as it doesn't place undue strain on the back end storage architecture to service all clients in parallel at the fullest of performance for production. We just use network and storage resources for backup.&lt;br /&gt;&lt;br /&gt;What does this solve? In the case of the disaster recovery, the reverse backup process can be done, getting streaming I/O rates, perhaps as high as 100+MB/sec over gigabit ethernet when the local arrays fail. In the case of my mail spool filesystem, we recovered at rates of 25-30MB/sec instead of the 500-800K/sec we saw. Even if its not the most up to date copy, if one also did file-level backup of the underlying file system to NexentaStor or the like at a faster interval, you can recover from that incrementally after the first block level recovery. Either way, you taste the sweetness of success. Again, the dirt is stuffed under the rug if this is hundreds of terabytes, and some day soon that may also be just as common place, but perhaps we are again ahead of the curve. The one side affect is that you'll want more cheap storage readily available on the second tier.&lt;br /&gt;&lt;br /&gt;I'll quickly describe the second scenario, where a failing system also needed its 1TB of mainly larger files migrated. We saw that our top rates of file level recovery at best were 1GB/minute, but generally less. Again, it would have made sense to have been redundant up front, but the same solution above could more than double the rate of recovery if we could restore the primary file system at block speeds. This is similar to how virtual machines are managed from SAN, iSCSI, or even NFS. The VMs themselves are represented as files, and so operations on these files approach the maximum speed of block storage operations. However, having them on that NAS allows ease of sharing and management, including snapshots. No hardware tricks, all software. We haven't addressed the next stumbling blocks, which include kernel page size limitations on true file I/O, but the dirt is nicely hidden for the time being.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/14093712-6097848112223487987?l=jmlittle.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://jmlittle.blogspot.com/feeds/6097848112223487987/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=14093712&amp;postID=6097848112223487987' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/14093712/posts/default/6097848112223487987'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/14093712/posts/default/6097848112223487987'/><link rel='alternate' type='text/html' href='http://jmlittle.blogspot.com/2008/01/swept-under-rug.html' title='Swept Under the Rug'/><author><name>jmlittle@gmail.com (Joe Little)</name><uri>http://www.blogger.com/profile/09731419203596760536</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-14093712.post-8359151472018963652</id><published>2007-11-02T12:05:00.000-07:00</published><updated>2008-04-03T15:23:56.515-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='lustre'/><category scheme='http://www.blogger.com/atom/ns#' term='zfs'/><category scheme='http://www.blogger.com/atom/ns#' term='opensolaris'/><title type='text'>Old is new again</title><content type='html'>Prior to my happiness in using all things ZFS, I was eagerly testing Lustre. Primary concerns there was both management of the backend storage and ability to translate this network file system to standard NFS and CIFS for non-Linux clients. Cluster FileSystems was recently announced to be acquired by Sun, but prior to that, they made announcements about how Lustre and ZFS would be married. I always thought that perhaps it was that ZFS would be the exposed layer on a client, and that it was the same old OSTs and OSDs on the backend. Not so.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://insidehpc.com/2007/07/14/cfs-moves-lustre-to-zfs/"&gt;CFS Moves Lustre to ZFS&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Instead, ZFS is the backend to the storage nodes. Albeit that the driver pool is somewhat less for OpenSolaris than Linux, it does make one wonder exactly what new flexibility is afforded by this arrangement. Also, how much of this will end up as open sourced code to be incorporated into highly manageable product, such as my new favorite appliance, &lt;a href="http://www.nexenta.com/"&gt;NexentaStor&lt;/a&gt;? I'll be diving further into this as more details emerge. Perhaps others can chime in with more info or clarification.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/14093712-8359151472018963652?l=jmlittle.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://jmlittle.blogspot.com/feeds/8359151472018963652/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=14093712&amp;postID=8359151472018963652' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/14093712/posts/default/8359151472018963652'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/14093712/posts/default/8359151472018963652'/><link rel='alternate' type='text/html' href='http://jmlittle.blogspot.com/2007/11/old-is-new-again.html' title='Old is new again'/><author><name>jmlittle@gmail.com (Joe Little)</name><uri>http://www.blogger.com/profile/09731419203596760536</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-14093712.post-530967970785297486</id><published>2007-11-01T08:08:00.000-07:00</published><updated>2008-04-03T15:25:00.030-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='nexenta'/><category scheme='http://www.blogger.com/atom/ns#' term='tiering'/><category scheme='http://www.blogger.com/atom/ns#' term='nas'/><category scheme='http://www.blogger.com/atom/ns#' term='nexentastor'/><title type='text'>The Coming Out Party for Commodity Storage</title><content type='html'>If you have been following along, I remarked in &lt;a href="http://jmlittle.blogspot.com/2007/09/multi-tier-storage-revisited.html"&gt;http://jmlittle.blogspot.com/2007/09/multi-tier-storage-revisited.html&lt;/a&gt; that "the increasing capabilities of Nexenta's storage solution and its underlying OpenSolaris base have proceeded a pace, and I believe the future is very bright for this solution". Its one of the few bright spots that I've had the privilege of using to enable commodity-based storage solutions. I've been an early adopter of the NexentaStor multi-tier storage appliance, and I am happy to hear that not only is it approaching its first general release to customers, but a release candidate is being made available to the public. Although I run it directly on hardware, the VMware evaluation version of the product has been deemed fit enough for people to kick the tires and see exactly how this fits in the organization. Check out &lt;a href="http://www.nexenta.com"&gt;http://www.nexenta.com&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Many will ask how this is different from either the hardware based NAS and Disk-to-Disk solutions, and others will wonder how does this compare to FreeBSD and Linux based solutions and projects already on the market. It comes down to what its does best now, and the potential of where it will go in the future. If you haven't been catching the storage news lately, NexentaStor is the first major product being built on the ZFS filesystem which brings to commodity storage much of what has till now only be accessible by the hardware vendors. Its that secret sauce that has justified those large margins and high priced "vendor-provided and tested" disk drives. What if you could just build it out on your own? Many open source solutions supposedly allow for just that, but its somewhat beyond a do-it-yourself level: the pieces aren't necessarily all integrated, nor is the complete solution truly comparable to commercial solutions or are they production ready. The test is would you feel safe having 50TB of your backups on that solution?&lt;br /&gt;&lt;br /&gt;ZFS is all fine and good, but its the integration I speak of that have made me settle on this particular product. It also brings a fully developed commercial grade NFSv4 server solution, fully managed snapshots with the necessary scheduling, multiple replication and tiering services to integrate it anywhere in my digital archive flow, virtualized and thin provisioning, iSCSI target and client support of said storage, and when installed on brawnier hardware architected to grow, it will quickly eclipse many heavily marketed primary storage solutions, at a true fraction of the cost.&lt;br /&gt;&lt;br /&gt;Nexenta is building this on OpenSolaris and their own hybrid opensolaris/debian-style distribution. Its has just started to stretch its legs when it comes to potential. However, our use is in second-tier storage, and that truly is where is shines right now. We've already thrown 50TB of disk at this via SCSI, iSCSI, SATA, and the like. It enables reuse of the storage you have now for a credible tiering architecture, and its both the web based interface and extensive command line interface that allow both legacy and new storage components to be managed. I could go to a zetabyte of storage with unlimited snapshots with the current installation, but one would undoubtedly want a more thought out long term hardware architecture. At least the current design allows for phasing in new technology while phasing out the old in the same pools I use today. Long term, I have high hopes that the product further simplifies data growth and management of a multitude of devices.&lt;br /&gt;&lt;br /&gt;Now that this is finally available for public consumption, I'll be able to speak more and provide good best practice advice. Here is some ready advice to keep in mind:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;1)&lt;/span&gt; As per disk capacity grows while prices drop, the exposure window of rebuilding any lost disk makes it more clear that RAID10 provides the best of all worlds for volume growth, redundancy, and recoverability.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;2)&lt;/span&gt; Don't throw away your primary storage. Its still a mature product, and NexentaStor is best suited to secondary storage at this time. Long term, you can migrate that primary storage into the second tier, managed by NexentaStor. Once you are familiar and comfortable with commodity based storage solutions, you'll find it moving to primary storage environments when its good and ready.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;3)&lt;/span&gt; That all said, commodity based storage solutions are now here. The wait is over, jump on in today.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/14093712-530967970785297486?l=jmlittle.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://jmlittle.blogspot.com/feeds/530967970785297486/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=14093712&amp;postID=530967970785297486' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/14093712/posts/default/530967970785297486'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/14093712/posts/default/530967970785297486'/><link rel='alternate' type='text/html' href='http://jmlittle.blogspot.com/2007/11/coming-out-party-for-commodity-storage.html' title='The Coming Out Party for Commodity Storage'/><author><name>jmlittle@gmail.com (Joe Little)</name><uri>http://www.blogger.com/profile/09731419203596760536</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-14093712.post-6023509265444569692</id><published>2007-10-12T15:15:00.000-07:00</published><updated>2007-10-12T15:31:27.570-07:00</updated><title type='text'>Stanford's multi-tier solution: an Example</title><content type='html'>&lt;a href="http://1.bp.blogspot.com/_8PsagR9Aa2c/Rw_yu8SMcxI/AAAAAAAAABk/Pq9iQth-LmE/s1600-h/stanford-multitier.jpg"&gt;&lt;img style="float:right; margin:0 0 10px 10px;cursor:pointer; cursor:hand;" src="http://1.bp.blogspot.com/_8PsagR9Aa2c/Rw_yu8SMcxI/AAAAAAAAABk/Pq9iQth-LmE/s320/stanford-multitier.jpg" border="0" alt=""id="BLOGGER_PHOTO_ID_5120578189613888274" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;In my last post covering our year long foray into multi-tier storage, I promised I would detail the specific configuration of systems used at Stanford in production. Prominently featured are a NetApp FAS3050 (28TB raw, standard dual parity RAID4) used as the primary storage, as well as a Sun X4500 utilizing NexentaStor (24TB raw, RAIDZ). We have a secondary NexentaStor head for location independence as well as further expansion with both SCSI and iSCSI attached storage, adding about another 26TB. What's missing from the picture are two important facts. First, other general purpose file servers are also being tiered to the second head. The second point is that one of them is a 4TB NAS unit based on the same NexentaStor product. The various capabilites of Nexenta's product allow it to perform well as primary storage, but in the commodity hardware realm, you are currently lacking in some features the the most discerning of storage customers will still find in most integrated hardware solutions. Time alone will see where the proper mix of hardware and Nexenta's solution end up.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/14093712-6023509265444569692?l=jmlittle.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://jmlittle.blogspot.com/feeds/6023509265444569692/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=14093712&amp;postID=6023509265444569692' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/14093712/posts/default/6023509265444569692'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/14093712/posts/default/6023509265444569692'/><link rel='alternate' type='text/html' href='http://jmlittle.blogspot.com/2007/10/stanfords-multi-tier-solution-example.html' title='Stanford&apos;s multi-tier solution: an Example'/><author><name>jmlittle@gmail.com (Joe Little)</name><uri>http://www.blogger.com/profile/09731419203596760536</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/_8PsagR9Aa2c/Rw_yu8SMcxI/AAAAAAAAABk/Pq9iQth-LmE/s72-c/stanford-multitier.jpg' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-14093712.post-8060019300931823296</id><published>2007-09-28T20:05:00.000-07:00</published><updated>2007-09-28T20:43:05.584-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='nexenta'/><category scheme='http://www.blogger.com/atom/ns#' term='multi-tier storage'/><category scheme='http://www.blogger.com/atom/ns#' term='tape'/><title type='text'>Multi-Tier Storage revisited</title><content type='html'>Its been a year since I posted here regarding Stanford's current and planned use of various storage solutions to virtually eliminate tape-based nightly archiving. Since then, the industry has gone through various changes, and in some cases, not much change at all.&lt;br /&gt;&lt;br /&gt;Specifically, part of our solution used NeoPath's FileDirector for file based virtualization, and SBEI's iSCSI target solution for our backend storage. In the middle, pulling data from our primary NetApp fileservers, was a burgeoning solution being BETA tested at Stanford from Nexenta. So what has changed? NeoPath was acquired by Cisco, with the current product in production ceasing to be supported. SBEI has been acquired by NeoNode, and their iSCSI target, best in class for enabling commodity storage, isn't getting much love. How has Nexenta fared? While we will likely need to migrate away from the other solutions, the increasing capabilities of Nexenta's storage solution and its underlying OpenSolaris base have proceeded a pace, and I believe the future is very bright for this solution.&lt;br /&gt;&lt;br /&gt;The NexentaStor product, in early BETA, delivers today on providing a snapshot based large scale file system, utilizing underlying storage pools (iSCSI, SCSI, SATA, FC, etc) and a well developed services architecture including data synchronization and replication, multi-host data tiering, and other facilities with data retention and disaster management to boot. Its base system disks even have bullet proof checkpointing, reversion, and safe updating, redundancy,  all in a software solution. The future? Well, its easy to perceive with upcoming NFS v4.1 support that the product can tackle name space virtualization one has found in products such as the NeoPath. Already it can repurpose snapshot-based raw volumes as iSCSI targets, so if the underlying hardware is supported by OpenSolaris, you have an easily managed enterprise-feature level iSCSI target product.&lt;br /&gt;&lt;br /&gt;Stanford has over a years worth of second tier data, in both 60 daily and 12 monthly snapshots, tiered from our NetApp. These are within many separate folders, representing over a thousand snapshots per volume. We've recently adopted the Sun X4500 24TB product and have migrated to this ideal solution for quicker disaster recovery. The read speeds on this 48 drive unit are great, and the price point rivals what we've built with iSCSI. Commodity storage is commodity storage, but we continue to utilize iSCSI, DAS (SATA-to-SCSI units), and other additional units to eclipse 48TBs of secondary storage. We have also utilized this solution for one organization as both first and second tier storage, an additional 16TB when we consider their solution, and it has proven its worth both in day to day NAS use as well as some data recovery and full disaster recovery modes.&lt;br /&gt;&lt;br /&gt;Now that Nexenta supports some backup software as well as a client, we've only backed up directly to tape from the second tier once per year.  We've let our LTO-2 tape library run continuously for around a week just to give us a full archived edition of our data. Are we missing tape reuse, tape-based recovery,  or multiple library scheduling (and rescheduling) just to meet an ever growing nightly backup window? Nope. Nexenta looks to be here to stay.&lt;br /&gt;&lt;br /&gt;I'll follow up later with specific details on configuration, where I hope things will go, and other random thoughts. On this anniversary, it would appear commodity-based multi-tier storage is practical and readily available.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/14093712-8060019300931823296?l=jmlittle.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://jmlittle.blogspot.com/feeds/8060019300931823296/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=14093712&amp;postID=8060019300931823296' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/14093712/posts/default/8060019300931823296'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/14093712/posts/default/8060019300931823296'/><link rel='alternate' type='text/html' href='http://jmlittle.blogspot.com/2007/09/multi-tier-storage-revisited.html' title='Multi-Tier Storage revisited'/><author><name>jmlittle@gmail.com (Joe Little)</name><uri>http://www.blogger.com/profile/09731419203596760536</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-14093712.post-231773127594684121</id><published>2007-05-12T11:38:00.000-07:00</published><updated>2007-05-12T11:47:14.806-07:00</updated><title type='text'>CommunityOne</title><content type='html'>I had the pleasure of presenting the case for Nexenta, an OpenSolaris distribution that combines the best of Solaris with what is best in an Ubuntu/Debian distribution, at CommunityOne last week. My slides are now available &lt;a href="http://winterfell.stanford.edu/jlittle/CommunityOne_2007-Nexenta.pdf"&gt;online&lt;/a&gt;. The demos actually reference Martin Man's great flash demos that he posted at &lt;a href="http://martinman.net/software/nexenta"&gt;his site&lt;/a&gt;.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/14093712-231773127594684121?l=jmlittle.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://jmlittle.blogspot.com/feeds/231773127594684121/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=14093712&amp;postID=231773127594684121' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/14093712/posts/default/231773127594684121'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/14093712/posts/default/231773127594684121'/><link rel='alternate' type='text/html' href='http://jmlittle.blogspot.com/2007/05/communityone.html' title='CommunityOne'/><author><name>jmlittle@gmail.com (Joe Little)</name><uri>http://www.blogger.com/profile/09731419203596760536</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-14093712.post-115942263384023954</id><published>2006-09-27T22:44:00.000-07:00</published><updated>2007-01-06T11:03:26.719-08:00</updated><title type='text'>Multi-Tier Storage -- The Commodity Approach</title><content type='html'>I've been working on some internal documentation explaining some of our long term plans regarding storage. Initially, I imagined two documents, one an executive overview, and another a complete documentation set. Well, as things are hard to write up the first time and maintain, I decided to make one document, moving all the technical and site specific details to appendices. The end result is that I can now post the primary document sans appendices and make it public. &lt;br /&gt;&lt;br /&gt;This is the culmination of a multi-year project to move to reliable commodity based storage and get away from nightly tape backup scenarios that do not scale with today's storage growth. So go ahead, and check out &lt;a href=http://winterfell.stanford.edu/jlittle/multitier-whitepaper.pdf&gt;my multi-tier storage whitepaper&lt;/a&gt;.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/14093712-115942263384023954?l=jmlittle.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://jmlittle.blogspot.com/feeds/115942263384023954/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=14093712&amp;postID=115942263384023954' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/14093712/posts/default/115942263384023954'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/14093712/posts/default/115942263384023954'/><link rel='alternate' type='text/html' href='http://jmlittle.blogspot.com/2006/09/multi-tier-storage-commodity-approach.html' title='Multi-Tier Storage -- The Commodity Approach'/><author><name>jmlittle@gmail.com (Joe Little)</name><uri>http://www.blogger.com/profile/09731419203596760536</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-14093712.post-115302910861126213</id><published>2006-07-15T22:47:00.000-07:00</published><updated>2006-07-15T23:03:27.240-07:00</updated><title type='text'>Converting LDAP netgroup entries back to flat file format</title><content type='html'>I am surprised that no where on the net, someone hasn't posted how to convert back to flat file a netgroup objectclass. This is important for loading this dynamic data back into systems that are themselves relying on static files. You'll need openldap-clients or similar packages (to get ldapsearch). Also, in the below script I expect anonymous read access, and no SASL auth obviously. Finally, the "grep net" part of the netgrouplist is to only grab netgroup names with "net" in them, which is what we have standardized on.&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(0, 0, 0);font-size:78%;" &gt;&lt;span style="font-family:times new roman;"&gt;&lt;br /&gt;#!/bin/bash&lt;br /&gt;BASE="dc=example,dc=com"&lt;br /&gt;HOST="ldap.example.com"&lt;br /&gt;&lt;br /&gt;netgrouplist=`ldapsearch -x -b "$BASE" -h $HOST objectclass=nisnetgroup cn | grep cn: | grep net | awk '{print $2}'`&lt;br /&gt;&lt;br /&gt;for i in $netgrouplist&lt;br /&gt;do&lt;br /&gt;echo "$i \\"&lt;br /&gt;ldapsearch -x -b "$BASE" -h $HOST cn="$i" &amp;gt; /tmp/netgrp.$$&lt;br /&gt;dn=`cat /tmp/netgrp.$$ | grep dn`&lt;br /&gt;cat /tmp/netgrp.$$ | grep nisNetgroupTriple | awk -F' ' '{print $2}' &amp;gt; /tmp/netgrp-hosts.$$&lt;br /&gt;lastentry=`tail -1 /tmp/netgrp-hosts.$$`&lt;br /&gt;for j in `cat /tmp/netgrp-hosts.$$` &lt;br /&gt;  do&lt;br /&gt;    if [ $j == $lastentry ]&lt;br /&gt;      then echo -e "\t $j"&lt;br /&gt;    else &lt;br /&gt;      echo -e "\t $j \\"&lt;br /&gt;    fi&lt;br /&gt;  done&lt;br /&gt;rm /tmp/netgrp-hosts.$$ /tmp/netgrp.$$&lt;br /&gt;done&lt;br /&gt;&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/14093712-115302910861126213?l=jmlittle.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://jmlittle.blogspot.com/feeds/115302910861126213/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=14093712&amp;postID=115302910861126213' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/14093712/posts/default/115302910861126213'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/14093712/posts/default/115302910861126213'/><link rel='alternate' type='text/html' href='http://jmlittle.blogspot.com/2006/07/converting-ldap-netgroup-entries-back.html' title='Converting LDAP netgroup entries back to flat file format'/><author><name>jmlittle@gmail.com (Joe Little)</name><uri>http://www.blogger.com/profile/09731419203596760536</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-14093712.post-112071586113039005</id><published>2005-07-06T21:56:00.000-07:00</published><updated>2005-07-06T23:21:14.913-07:00</updated><title type='text'>Configuring NeoPath for multi-tiered storage</title><content type='html'>&lt;span style="font-weight: bold;"&gt;The NeoPath Solution&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;We have embarked on a path whereby data is no longer incrementally backed up to tape, but rather backed up to a second tier of disk space. The reasoning is that tape technology is no longer able to keep up with disk technology and pricing. As our primary storage grows year over year, a long term strategy for backing it up is required. Tape is still used, but it is now regulated to archival purposes only, and at long intervals at that.&lt;br /&gt;&lt;br /&gt;There are various 1st tier NAS solutions we are using, from NetApp to commodity storage with and without their own snapshot technologies. However, in most cases, snapshots are available for fine-grained file level backups that cover hourly changes or at least multiple images of the filesystem per day. However, each solution tends to have its own snapshot directories, and not all are relative to each directory in a volume. In some cases, snapshots are only available at the root of a volume.&lt;br /&gt;&lt;br /&gt;Our 2nd tier solutions are even more commodity, consisting of standard journaling filesystems on large volumes, representing at least double the capacity of its matching first tier. We provide backups using hard-link style snapshots using &lt;a href="http://www.rsnapshot.org/"&gt;rsnapshot&lt;/a&gt;. These pull an initial copy from the first tier either with rsync or via an NFS mount and periodically (daily) copy over deltas into new dated directories, preserving untouched files with hard links. 2nd tier storage is coarser, representing daily snapshots over multiple months of incremental diffs. At regular intervals (6 months on average), a 2nd tier snapshot is used as the source of a tape archival. Again, these snapshots tend to differ from the first tier in its layout and directory structure. More importantly, the 2nd tier systems are not directly accessible by the end users.&lt;br /&gt;&lt;br /&gt;Finally, the multi-tier solution breaks out storage onto multiple distinct servers. How do end users know where to get their data, and how can they acquire self-serve restorations? The solution we have found is &lt;a href="http://www.neopathnetworks.com/"&gt;NeoPath&lt;/a&gt;. This product acts as an NFS or CIFS aggregator, allowing new logical paths to be made to consolidate storage into a single logical tree if necessary. It also provides for live data migration between servers, so it protects ones continuing investment of 1st and 2nd tier storage solutions, allowing for the acquisition of new 1st tier storage and migration of older storage to the backend or out of service entirely. The migration concept can be a critical feature when failing hardware needs to be replaced and its impossible to disentangle a system from centralized storage services. Other features include defining virtual servers, synthetic directory trees, and synthetic links and unions formed from back end mounted file servers.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Minor Quibbles&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;For all of its advantages, the NeoPath product still has its faults. Primarily, design decisions made to do the right thing can at first get in the way of its basic implementation. The first problem we noticed is that NeoPath on its face requires each backend share to be read-write since it needs to store meta data about that share on each back end file server (within a hidden directory). In the case of snapshot file systems directly exported, you are only given a read-only system by design of the NAS. Second, even in the case of our 2nd tier hard-link level file snapshots, we do not wish to expose that filesystem to the outside as read-write. When you re-export multiple tiers into a single tree, clients are generally given read-write permissions to their primary directories, and its only possible to enforce read-only permissions on parts of that tree if the NeoPath itself is reading it read-only.&lt;br /&gt;&lt;br /&gt;The other problem is that although clients can mount at any permissible point in a share, the NeoPath product itself only permits inclusion of back end directories into its virtual trees via the explicitly defined exports of the back end file servers. You can not directly reference a deeper directory in the formation of unions or synthetic links. It also will only honor the first mount point in encounters in traversing a back end file server. Thus, you can not get around the issue by defining multiple levels of exported points to mount from.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;All Problems Have Solutions&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;&lt;span style="font-weight: bold;"&gt;&lt;span style="font-weight: bold;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;We have successfully resolved these issues with a little guidance from NeoPath. Overtime, I hope to refine and further explain these solutions.&lt;br /&gt;&lt;br /&gt;First, whereas the web-based GUI does not let you explicitly define certain options on accessing back end servers, the command line environment does. You need to consult the command line reference guide, but the gist is that to allow for read-only volumes, one can define alternative dstorage locations (NeoPath speak). What we did was define a small read-write share from a NAS that was 64MB in size. That was the smallest available volume offered by our NAS, so you can likely go smaller. This volume will serve as metadata storage for all file servers if you instruct the NeoPath product to do so. Now, we can pull in any type of share regardless of write permissions.&lt;br /&gt;&lt;br /&gt;The second problem has a similar but more involved solution. One can get past the issue of deep linking on back end servers by creating another minimal volume where symbolic links will be created. The idea is that when creating a synthetic directory, one should mount all back end file servers in a uniform path. The primary paths in the synthetic directory that users will see should be created in the small volume, utilizing relative symbolic link references to the uniform paths including any necessary deep references. An example would be useful here. Take this directory structure:&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(51, 51, 51);font-family:courier new;font-size:78%;"  &gt;/myorg/users : (/myorg/users/jlittle -&gt; /myorg/tier/1/vol7/users/jlittle)&lt;br /&gt;/myorg/backup/tier1 : (/myorg/backup/tier1/users/jlittle -&gt; /myorg/tier/1/vol7/.snap)&lt;br /&gt;/myorg/backup/tier2 : (/myorg/backup/tier2/users/jlittle -&gt; /myorg/tier/2/vol7)&lt;br /&gt;/myorg/tier/1 : (contains vol1 through vol7 mounts of tier 1 system)&lt;/span&gt;&lt;span style="color: rgb(51, 51, 51);font-size:78%;" &gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="color: rgb(51, 51, 51);font-family:courier new;font-size:78%;"  &gt;/myorg/tier/2 : (contains vol1 through vol7 mounts via a union of tier 2 systems)&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;In the above example, the &lt;span style="font-style: italic;"&gt;users&lt;/span&gt; and &lt;span style="font-style: italic;"&gt;backup &lt;/span&gt;trees are synthetic links to two small volumes defined on a NAS for the purpose of generating symbolic link trees. The last two lines are direct synthetic links to back end 1st and 2nd tier storage. The various backup points are actually links to the head of each snapshot volume, as the user first needs to traverse into a date-labeled directory before proceeding in &lt;span style="font-style: italic;"&gt;/users/username&lt;/span&gt; or the equivalent. To generate the symbolic link tree, I took output from file listings that show actual relative paths per user (eg: &lt;span style="font-style: italic;"&gt;../vol7/users/jlittle&lt;/span&gt;) and built a little script to be run from a system mounting the base tree of /myorg.&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(0, 0, 0);font-size:78%;" &gt;&lt;span style="font-family:times new roman;"&gt;#!/bin/bash&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:times new roman;"&gt;MNTDIR=/mnt&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:times new roman;"&gt;SRCFILE=/root/users-lists&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:times new roman;"&gt;for LINE in `cat $SRCFILE`&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:times new roman;"&gt;do&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:times new roman;"&gt;  VOL=`echo $LINE | awk -F'/' '{ print $2 }'`&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:times new roman;"&gt;  USER=`echo $LINE | awk -F'/' '{ print $4 }'`&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:times new roman;"&gt;  echo $VOL $USER&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:times new roman;"&gt;cd $MNTDIR/users&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:times new roman;"&gt;ln -sf ../tier/1/$VOL/users/$USER $USER&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:times new roman;"&gt;cd ../backup/tier1/users/&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:times new roman;"&gt;ln -sf ../../../tier/1/$VOL/.snap $USER&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:times new roman;"&gt;cd ../../tier2/users/&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:times new roman;"&gt;ln -sf ../../../tier/2/$VOL $USER&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:times new roman;"&gt;done&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The only issue left is to provide easy maintenance of this link list as users move around. Its an exercise left to administrators to tie this into their account creation and migration scripts/processes.&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;&lt;span style="font-weight: bold;"&gt;&lt;span style="font-weight: bold;"&gt;&lt;span style="font-weight: bold;"&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/14093712-112071586113039005?l=jmlittle.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://jmlittle.blogspot.com/feeds/112071586113039005/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=14093712&amp;postID=112071586113039005' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/14093712/posts/default/112071586113039005'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/14093712/posts/default/112071586113039005'/><link rel='alternate' type='text/html' href='http://jmlittle.blogspot.com/2005/07/configuring-neopath-for-multi-tiered.html' title='Configuring NeoPath for multi-tiered storage'/><author><name>jmlittle@gmail.com (Joe Little)</name><uri>http://www.blogger.com/profile/09731419203596760536</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-14093712.post-112016839443522812</id><published>2005-06-30T14:50:00.000-07:00</published><updated>2005-07-06T23:17:32.543-07:00</updated><title type='text'>What I may say...</title><content type='html'>This is a first blog entry, and as such, it will serve as an introduction to what may show up here. I've played with various blog clients (iBlog is great!), wiki's, CMS, etc. In the end, I want consistency in what I use, and the use is pretty erratic. I mostly need a place to build documentation for the various complete, semi-complete, or planning stage projects that I'm always doing in parallel. I find my attempts at documentation wanting. Therefore, any consistent blog approach may in fact aid in these efforts.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/14093712-112016839443522812?l=jmlittle.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://jmlittle.blogspot.com/feeds/112016839443522812/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=14093712&amp;postID=112016839443522812' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/14093712/posts/default/112016839443522812'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/14093712/posts/default/112016839443522812'/><link rel='alternate' type='text/html' href='http://jmlittle.blogspot.com/2005/06/what-i-may-say.html' title='What I may say...'/><author><name>jmlittle@gmail.com (Joe Little)</name><uri>http://www.blogger.com/profile/09731419203596760536</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry></feed>
