Little Notes

Wednesday, May 29, 2013

Sorting Derived (Transient) Domain Properties in Grails

Every time someone on the Grails list or StackOverflow remarks that sorting only works in GORM with database persisted properties, I'm left scratching my head why post database faulting results can't be easily sorted. Groovy actually enables this with its own List.sort() mechanism, which I've not posted an example of in the Derived Sorts project on github.

The key to the sorting is to know what fields can be sorted without custom code, and then based on a stringent selection of sortable properties, one can post-sort the results, including secondary sorts. It gets rather complicated quickly when you consider multiple sort criteria. Here's a direct link to the controller code. Any hints to clean it up further for future readers all welcome. The code now shows working pagination. I used the solution for PaginatebleList that Colin Harrington posted.

Dtrace training -- its worth it!

While its fresh in my mind, I wanted to cover the wondering Dtrace training I had with Max Bruning and Brendan Gregg. The three day course is occasionally offered by Joyent, and here is an example synopsis. Not only do you get the meaty book on the subject, but the workbook and the labs you do are challenging. Sadly I came in a not too distant second on the 'ole leader board for solving the most labs.

What is more interesting was how quickly I found myself using the tool after the class. Although I heavily utilize KVM instances in Joyent's SDC, I got a request the following day as students were approaching "tape out" for their chips. Various students weren't cleaning up after themselves, with simulations running endlessly. The usual quip that "the internet is slow" or some such came quickly, in this case the statement that the network to and from the primary VM for some Ansys tools was very slow. Looking inside this 16 core, 64GB VM running Linux, it was not entirely apparent what was going wrong. The load was high, and in the end, I could have done some basic thread counting and discovered the answer if I knew where to look. However, I was able to use Dtrace in an opaque way and see from the hardware node that the qemu/kvm process was using its 16 cpu threads full tilt, and it was not I/O bound. Looking back into the Linux VM, it took little time to find that it was beyond the 16 CPU cores, trying to run at least 17 full time tasks. One task had over 20GB of memory mapped, and each time it yielded execution time and reacquired its CPU, it would undoubtedly spend extra effort in dealing with its large memory payload.

Killing off just one process here resolved the performance problem. VMs are sort of the worse case scenario for Dtrace, but it still guided me to the solution. The trick with the book is that one needs real world examples and enough practice to get the specific predicates and formation of queries down. Once you are over the hump, and I'm not sure I'm there yet, you'll find Dtrace to be indispensable.

Thursday, May 09, 2013

Decoding MPT_SAS drives in Nexenta/Illumos

Another group on campus had SAS resets on a drive, but the drive never failed. All we ever got was something like this in dmesg reports:

May 9 07:51:50 hostname scsi: [ID 365881 kern.info] /pci@0,0/pci8086,d138@3/pci1000,3010@0 (mpt_sas0):
May 9 07:51:50 hostname Log info 0x31080000 received for target 9.
May 9 07:51:50 hostname scsi_status=0x0, ioc_status=0x804b, scsi_state=0x0

We have a target address, but a zpool status or equivalent never gives us that target number, just a long string like "c0t50014EE057D81DE7d0". How to find the drive to off-line and replace it?

For future reference, you'll need to follow this link:

https://www.meteo.unican.es/trac/meteo/blog/SolarisSATADeviceName

You'll also need the lsiutil, which I is part of the LSIUtil Kit 1.63. For reference: One valid download link --- Its next to impossible to find this on LSI's site and the Oracle references are all down now it seems.

Tuesday, May 07, 2013

NexentaStor auto-tier ACL workaround

Let us say that you created an auto-tier and said yes to preserve ACLs, but you used rsync as the protocol. Well, this is not realistic, and you'll see "Error from ACL callback (before): OperationFailed: the job was not completed successfully" in your logs after 0 seconds from running the auto-tier.

How to remove the ACL requirement? Delete and re-create the service? Not necessary.

If you run "show auto-tier :data-fsname-000" or the like, you'll see a value like "flags : copy ACLs" that is not otherwise addressable in properties. Just "setup auto-tier :data-fsname-000 property flags" and adjust the value "1024" to "0" and you'll remove the flag. Thats it.

Friday, May 03, 2013

How To: Pluribus NAT Routing

Its no secret that at Stanford we do a lot with OpenFlow. We get to play with some new and interesting stuff that we integrate into our OpenFlow network. One of these is the Pluribus Network switch, which combines system and network virtualization with a high bandwidth 48+ port 10GB switch fabric. We have been running this in our network, and for months it has been handling the heaviy lifting duties for our SmartOS-based private cloud.

Various features including OpenFlow functionality have been tested, but the products user interface is still being crafted and changes some what over time. Recently, we needed to enable NAT routing for the private administrative network for the SmartOS private cloud. This network is not attached to a router interface, and applying something outside the network fabric to enable NAT or routing will create an undesired point of failure. Pluribus has full routing functionality tied to their virtual network capability. Here is the current command sequence used to enable routing between the private 10.0.x.0/16 administrative address space (could be larger) to an external routable network. I've added the VLAN to attach externally as VLAN 4444, and the fabric name is sdc-global:

> nat-create name sdc-global-gateway vnet sdc-global

> nat-interface-add nat-name sdc-global-gateway ip 10.0.27.1/24 if data

> nat-interface-add nat-name sdc-global-gateway ip 172.20.1.1/24 if data vlan 4444

> nat-map-add nat-name sdc-global-gateway name sdc-global-nat ext-interface sdc.global.gateway.eth0 network 172.20.1.0/24

sdc.global.gateway.eth0 should be the external port, as seen from "nat-interface-show"

UPDATE: A bug when first did this prevents the zone managing the NAT from having a correct default gateway. You'll need shell access and "zlogin sdc-global-gateway" or the like to enter the zone, add add /etc/defaultrouter with the IP of that router there for future use. Then you can exit the zone and run "zoneadm -z sdc-global-gateway reboot" to get it working.

Thursday, May 02, 2013

Grails Fixtures in Bootstrap.. the missing pieces

One nifty way to load in a lot of data into either development or perhaps even production instances of Grails apps is the Fixtures Plugin. You can more easily define loads of data and multiple relationships with this plugin. This plugin is designed to be used for integration test data, but as noted here, nothing should prevent you from loading it in your Grails bootstrap step. However, I ran into curious errors such as "Fixture does not have bean". For my future self, here's the solutions to avoid incorrect assumptions:

In the BootStrap.groovy file, define the service fixtureLoader within the BootStrap class (def fixtureLoader).
All fixtures must be in a closure titled "fixture {}"
Each domain class needs importing at the top of each fixture file. This is the resolution to the bean error noted above.

Thursday, April 04, 2013

Joyent SDC 6.5.6 released -- Upgrade workaround

Just a heads up that Joyent has released Smart Data Center 6.5.6 as noted here:

http://wiki.joyent.com/wiki/display/sdc/Upgrading+SDC+6.5.3+or+6.5.4+to+SDC+6.5.6

First upgrade attempt fails at the very end when selecting the correct platform. Joyent Support noted that it has seen this before, and that a "sdc-restore" from the pre-upgrade backup and then a reattempt should work. In my case, it did just that. I did the quick restore (no -F here). Rebooting the head node as I write this.

Tuesday, April 02, 2013

Save time in backing up Joyent Smart Data Center

One does not frequently backup the head node USB key with Joyent's Smart Data Center. Generally you do it prior to upgrades. Therefore, its commonly a "do it twice" process as it has a quirky bug with regards to terminal emulation that I never seem to remember until its too late:

[root@headnode (CIS:0) ~]# sdc-backup -U c2t0d0p0

Disk c2t0d0p0 will relabled, reformatted and all data will be lost [y/n] y

labeling disk

creating PCFS file system

mounting target disk c2t0d0

mounting source disk c0t0d0

copying files

setting up grub

Sorry, I don't know anything about your "xterm-256color" terminal.

Error: installing grub boot blocks

Yep, my OSX default xterm-256color is not known, and the many-minutes long backup process dies at the end. To address this, override the terminal setting in root user's .bash_profile file with the line:

export TERM=vt100

Simple, but its not every day you can increase performance by 50%, so to speak.

Thursday, February 28, 2013

The fine art of SmartOS image creation in SDC

I suspect many organizations that run Joyent's Smart Data Center have them operated by Joyent staff themselves. Template creation of SmartOS images is something any private cloud operator will need to do, and Joyent has basic information on how to do so. However, certain steps require tools and code generally only available or known to Joyent staff. I wanted to impart my knowledge on how to go about doing this here for my own notes and for others.

First, one can follow the instructions at http://wiki.joyent.com/wiki/display/jpc2/Creating+Your+Own+SmartMachine+Image

I found that creating the snapshot locally to the compute node, as mentioned near the bottom, was insufficient, but your mileage may vary. I used the UI to snapshot the templated VM. In my case, I used the Smart64 image as my base OS image to then customize as mentioned, such as adding tomcat, services, and configurations.

One step that I found problematic is the meta data creation. The commands for doing this were found only in Smart64 or similar instances, and not the underlying nodes or SDC head node zones. I created a new Smart64 instance for template manipulation, pointed it to my cloudapi host using the "sdc-setup" command, and after configuration, used sdc-updatemachinemetadata from /opt/local/bin. The specific command I used for my meta data example was:

sdc-updatemachinemetadata -m image_name="tomcat" -m image_version="1.0.1" -m image_description="tomcat appserver" 99199472-bae6-4c89-a7ef-d6d4cf736feb

The final part of that line is the zone uuid after it has been shutdown. The final step is to run sdc-create-image, a script that is only available internal to Joyent. Please contact your team rep to get this. Once you have it, your image publishing is a trivial command, run from the head node:

./sdc-create-image 99199472-bae6-4c89-a7ef-d6d4cf73757

With that, your new template is created, and your users can not pick your new application image to instantiate.

Saturday, January 12, 2013

Joyent Anniversary: What's Next

It was just over a year ago I wrote about some of my initial thoughts regarding Joyent's Smart Data Center product and SmartOS in general. A lot has changed since then and the product has both matured and found more acceptance.

I never really got into the "Why" of using the product. We make decisions on product use entirely based on technical and strategic directions. Our use of virtual machine technology is not as common as what one would see among Amazon AWS or Joyent Public Cloud customers. Rather, the requirements have consisted of large, non-transient VMs used for simulations, CAD (of the chip variety) layout, and large data manipulation. We provide hardware for VMs that each require 16-64GBs of RAM, dozens of gigabytes of local storage, and easily 12-16 cores a piece. Up to now, the solutions chosen in academia have been the likes of VMware ESX, Xen, etc. The problem is that the performance, stability, and data integrity requirements either tend to the more expensive end of the product matrix of the above. Perhaps a novel scale-out cloud VM solution on premise would work out better. This second option can be found in either Joyent, which is ready to go now, or OpenStack and friends which will some day achieve similar levels of maturity and flexibility.

Not everything is rosy in Joyent land though. Its primary focus here to now has been to match Amazon AWS in most customer requirements, inclusive of a multitude of transient VMs. This has left the 6.5 revision of the product wanting in areas such as conversion/migration of existing thirty party VMs into the cloud instance, migrating VMs and settings/state between compute nodes, and even migration of head nodes between hardware. Over time, and with help from staff at Joyent, I've worked my way around these edge cases utilizing dataset and package templating, low level use of ZFS snapshots and send/recv, and sometimes just plain old reinstallation of components.

With the announcement of Joyent 7, most of the above issues are being addressed, and we hope to both utilize the newer version and push for more change down the line to make this our go-to tool for virtualization of our entire environment. Where we best hope to help it is in the network space, as we have an obvious preference for the adoption of OpenFlow (SDN) to enable ease of multi-datacenter deployments.

2013 looks like a good year for our VM directions, and I expect others out there will see similar benefit if they just give this technology a try. I didn't give Joyent much thought until the KVM port. A year later, I'm glad we did.

Wednesday, January 04, 2012

Getting hands dirty with Joyent SDC: first lesson learned

Finally getting into Joyent's private cloud technology. I'll talk more about what all of this is useful for some other time, but this post is more of a note to self / note of warning. I repurposed some beefy ESX nodes for testing out Smart Data Center. But, those didn't have disks worth anything. Instead, I took some disks that were evacuated out of ZFS pools for larger drives. They would still be fine here...

The problem arises in setting up compute nodes, and later in any re-installing if necessary of the headnode. Things would quietly fail without any errors on compute node configuration, and re-installs of the head node dug a deeper hole. Turns out that Joyent is being ever too cautious in creating data pools for the head and compute nodes, and won't attempt to create the necessary local disk pools if the disks were previously associated with active ZFS pools. Silent errors are never good.

The work around is to bring up the head node in their recovery mode, which is noted as not importing any pools. Next, associate the drives, import the pools (if fully there) or create a new pool for each individual disk, and then "zpool destroy" them. Rinse, repeat. I finally got my head node installed in a sane way, and now on to some remaining problems with compute nodes and testing out KVM and vcpu support. More on that later.

Monday, May 30, 2011

Web Frameworks / Models just aren't the same

Most of my days are spent hacking on web applications, with a strong requirement for databased-backed solutions. I've been drinking the WebObjects cool aid for quite some time, as there hasn't been a robust ORM (Object-Relational Mapping) solution that matches the maturity and the it-just-works of WebObjects' EOF layer.

However, the proverbial writing has been on the wall when it comes to Apple's continuing care taking of the public version of this technology. A lot of technologies have arisen to make the continuing effort one needs to take using WebObjects questionable, and my mind simply can't quite get around the rule engine solution of Modern Direct2Web that helps modernize WebObjects to match. Its always a question of finding the best tool for the job, and part of the toolset is one self. Am I sharp or honed enough to meet the new challenges I face? I've been both re-investing myself in WebObjects daily, but also checking out other frameworks. In almost all cases, I find again and again that they still don't match the now antiquated WO in getting things done right.

Then there is Grails. First, its not Rails, which leaves a sour taste in my mouth. But it seems to take enough from both the Java/WO and Rails worlds, some of the best and some of the worst (Servlets, bleh!). I'm also stuck dealing with both hibernate's deficiencies and Grails band aids above that. Burt Beckwith has provided multiple articles on the brain-dead dealings with collections and especially many-to-many relationships, requiring fetching of all entities to guarantee uniqueness in add and delete operations (its more an issue with BelongsTo and hasMany, original example and here's indirect implementation details). Obviously, the object graph shows some immaturity. Grails 1.4 and the underlying updates though finally get me past my fears and concerns, and so a few projects are now being built on Grails since I just can't get the quick build out of applications above the model layer I need in Modern D2W, and I require the dynamism of groovy for certain specific requirements. Again, its more finding the tool that suits me best and not the limitations of the tools.

This brings us to the meat of my posting today. EOF and Wonder's templates have spoiled me in what model code (including the generation gap pattern) is provided for me and what I expect at the model layer. I'm be trying to come to terms with both the features and lack thereof of model classes in Grails apps. Rereading the great book Grails In Action, I came to an important realization on what is missing here. Section 5.2 gets into the best practice of using Grails Services to encapsulate business logic per se and follow DRY principles. But, if one considers at least the MVC frameworks and where model logic goes, there seems to be a lot of the multi-domain logic (relationships) which never end up in Grails domains and which one needs to best handle in Services. In the end, I've come to believe that a direct mapping of WebObjects EOF models is not to Grails domains, but to Grails services instead.

With all the time in the world, I think I'd want to spend time on a plugin or template enhancements to auto-generate more complete service definitions from "grails create-service", one which takes a domain and extends it for basic operations, but builds out basic relationship management methods in the service. This would also be an ideal place to be collections aware and turn into best practice some of Beckwith's ideas. If collections were always handled in the same manner in code, it would make the complicated implementation of the correct, performant way much more trivial.

Furthermore, akin to the generation gap pattern, domains would be less tinkered with other than defining what can and should go into the database directly. This is important for managing database migrations. Instead, any and all custom logic should persist in the service. Perhaps one day Domains will get all the correct relationship handling logic that EOF superclasses generally get, and the Service is then more akin to the custom-logic-only aspect that I've come to expect of EOF subclasses for my model objects. However, I feel my mind can work with this construct to productive quickly in Grails instead of fighting against the grain or dirtying my controllers with model specific mess.

For now though, I will endeavor to always use Services extensively, and make sure any generated scaffolding takes them into account more than Domains.

Thursday, May 06, 2010

NexentaStor issues?

Someone pointed out this "review" to me and asked if it was true. I ran into a similar issue. The user in that article was using the free 12TB edition without support, so perhaps that was why they didn't ask around per se or file a bug.

So, why is copying from a ZFS volume to another over rsync seemingly going on forever? I can't be sure this is the issue, but I had the same result, but this time it was going from a NetApp to a ZFS data store using NexentaStor 3.0. The problem was that the source .snapshot tree was exposed, and likely in the case of the above reviewer, their .zfs tree was exposed. I've already mentioned to the Nexenta people that its safer to have as a default exclude the terms ".snapshot | .zfs" for rsync service definitions, and let the end user override it. I too first thought it was the dedup going awry, but what I found out the problem to be on experimentation was rsync discovering those hidden paths and syncing each one. Dedup will only find duplicate blocks that line up, but the overall exposure to all those snapshots will come at some price.

If you are pulling data from one snapshot-based file system to another, it is always best to do so relative to the most recent snapshot, as you are insured data isn't changing during the synchronization, and you'll avoid falling down the snapshot well.

Wednesday, March 17, 2010

WD Caviar Green drives and ZFS (UPDATED)

We are in the process of outfitting a new primary storage system, and I was of the mind to buy more WD Caviar Green drives, specially more of the 1.5TB WDEADS drives, as we had 4 new ones already that were tested behind a slower RAID card. Before buying more, I searched the usual suspects for pricing, and found the 1TB to 2TB versions of this drive are all priced very well, even for 5400RPM drives, but they now note on different sites and/or comments that they should not be used in RAID configurations. Hmm.

I did a little more research and saw this blog post depicting how one should avoid directly integrating these drives with ZFS. I got a couple, so I decided to put them in my server with an LSI-3442E SAS backplane and tested them. First, I tested my 500GB drives in a mirror set, and doing a "ptime dd if=/dev/zero of=test1G bs=4k count=250000" on the ZFS volume made up of those drives, I transferred 1GB in 3.63 seconds, or 282MB/sec. I then immediately tried the same on my mirror set of the WD drives, benefitting from caching of the first write. After 50+ minutes of waiting, I killed the write and saw that I transferred only 426MB, at a rate of 136KB/sec.

Yes, I can confirm that these drives are less than useless in a ZFS system (see update below), even as a simple two disk mirror set. Some basic iostat showed way too much "asvc_t" service time on the disks, running from 3.5 secs to 10 secs per write, where as the service times for the working 500GB drives were 0.7msec or the like. I had various errors mpt_handle_event_sync errors in my kernel logs, so perhaps there is some specific pathology between the SAS HBA, the SAS/SATA backplane, and these disks. However, we've proven this box works well with various drives. I'm going to try yet another 1.5TB drive, likely the previously maligned Seagate drives, since I've yet to have trouble with the latest firmware on those. My 4 WD drives will be placed in enclosures for external Time Machine backups in the near future. WD Caviar Green != Enterprise RAID drives.

UPDATE:

I'm leaving the above as is, but I think I have discovered perhaps a bad drive in the set, as when I employee 4 drives of this type I saw odd I/O patterns but ok performance in a straight RAID 0. However, I regularly have at least one drive with higher average service times, and trailing I/O writes as it catches up to the other drives. If I have these 4 drives in a pool (RAID 0), I got 193MB/sec writes, and 242MB/sec reads. Sticking them into a RAID10 (2 data, 2 mirror), I got a mirror 78MB/sec writes and 278MB/sec reads.

Splitting them off into two separate RAID1 data pools, I ran my tests and still saw high service times on the drives (only 65 or so, much better than the above, but still slow). Per mirror set performance was dismal, as I regularly got the 150MB/sec+ from a mirror of Caviar Black, but these drives got me just hit 31-34MB/sec (ie, half of the above RAID10). I guess with enough drives I'll get to better numbers in RAID10. In a RAIDZ1 (RAID5) grouping, it was 60MB/sec on the writes, and 172MB/sec on the reads.

So what accounts for the dismal performance I originally saw? I think it has to do with when multiple pools are active, and they are not all of this drive type. My original test had a Hitachi drive set as well as a WD Caviar Green drive set. Although my tests ran one at a time, I'm guessing there was some bad timing/driver issues and/or hardware issues when dealing with the mixed HD media.

A second, update conclusion is that you can use these drives, if only these drive types, in an array. RAID10 will get you sufficient performance, but otherwise you'll want to leave this to secondary storage. Future drive replacement scenarios are a real cause for concern.

Tuesday, March 02, 2010

ZFS Log Devices: A Review of the DDRdrive X1

My previous notes here have covered the trends to commodity storage, my happiness with most things ZFS and Nexenta, and how someday this will all make for a great primary storage story. At Stanford, we have a lot of disk-to-disk backup storage based on Nexenta solutions, using iSCSI or direct attached storage. We have also had some primary tier uses, but have had to play fast and loose with ZFS to get comparable performance. In essence, we sacrificed some of the ensured data integrity of ZFS to meet end users expectations of what file servers provide.

A typical thing that was done was to set these values:

set zfs:zil_disable = 1

set zfs:zfs_nocacheflush = 1

These flags allowed a ZFS appliance to perform similarly to Linux or other systems when it came to NFS server performance. When you are writing a lot of large files, the ZFS Intent Log's additional latency doesn't affect NFS client performance. However, when these same clients expect their fsyncs to be honored on the back end with mixed file sizes that trend to a large volume of small writes, we start to see pathologically poor performance with the ZIL enabled. We can measure the performance at 400KB/sec in some of my basic synthetic tests. With the ZIL disabled, I generally got 3-5MB/sec or so, or 10x the performance. That's cheating and not so safe if the client thinks a write is complete but the backend server doesn't commit it before power loss or crash.

One ray of hope previously mentioned on this site was the Gigabyte i-RAM. This battery backed SATA-I solution held some promise, but at the time I used it I found a few difficulties. First, the state of the art at that time did not allow removal of log (ZIL-dedicated) devices from pools. One had to recreate a pool if the log device failed. That raised some problems with the i-RAM. First, I had it go offline twice requiring resetting the device, essentially blanking it out and requiring re-initializing it as a drive with ZFS. Second, the connection was SATA-I only, with it not playing well with certain SATA-II chipsets or mixed with SATA-II devices. Many users had to enable it in IDE mode versus the preferred AHCI mode.

Time has passed, and new solutions present themselves. First, log devices can be added or removed from a pool at any time, on the fly. Also new to the discussion is the DDRdrive X1 product. This mixed RAM and NAND device provides for a 4G drive image with extremely high IOPS and a solution to save to stable store (NAND SLC flash) if power is lost on the PCI bus. The device itself is connected to a PCI-Express bus, with drivers for OpenSolaris/Nexenta (among others) that make it visible as a SCSI device.

I tried different scenarios with this ZIL device, and all of them make it a sweet little device. I had mixed files that I pushed onto the appliance via NFS (linux client) and found that I could multiply the number of clients and linearly increase performance. Where I would hit 450KB/sec without the ZIL device but not improve that rate by much with additional writers of data, using the ZIL log device immediately resulted in a good 7MB/sec of performance, with 4 concurrent write jobs yielding 27MB/sec. During this test, my X1 showed only a 20% busy rate using iostat. It would appear that I should get up to 135MB/sec at this rate (5x the concurrent writers), but my network connection was just gig-e, so getting anywhere near 120+MB/sec would be phenomenal. Another sample of mixed files with 5 concurrent writers pushed the non-X1 config to 1.5MB/sec, but in this case, the X1 took my performance numbers to 45-50MB/sec.

So what is providing all this performance? As I mentioned above, the fsyncs on writes from the NFS client enforce synchronous transactions in ZFS when the ZIL is not disabled. My IOPS (I/O Operations per second) without a X1 log device were measured around 120 IOPS. With the dedicated RAM/NAND DDRdrive X1 solution, I easily approach 5000 IOPS. Those commits happen quickly, with the final stable store to your disk array laid out in your more typical 128K blocks per IOP. This dedicated ZIL device has been shown to do up to 200000 IOPS in synthetic benchmarks. Lets try the NFS case one more time, in a somewhat more practical test.

Commonly, in simulation, CAD applications, software development, or the like you will be conversing with the file server committing hundreds to thousands of small file writes. To test this out and make it the worse case scenario of disk block-sized files, I created a directory of 1000 512 byte files on the clients local disk. I did multiple runs to make sure this fit in memory so that we were measuring file server write performance. I then ran 400 concurrent jobs writing this to the file server into separate target directories. First, with the dedicated ZIL device enabled, I got 24MB/sec write rates averaging 6000 IOPS. I did spike up to 43K IOPS and 35MB/sec, likely when committing some of the metadata associated with all these files and directories. Still, the X1 was only averaging 20% busy during this test.

Next, I disabled the DDRdrive X1 and tried again, hitting the same old wall. This was the pathological case. With 400 concurrent writes I still just got 120 IOPS and 450KB/sec. My only thought at the time was "sad, very sad".

You can draw your own conclusions from this mostly not-too-scientific test. For me, I now know of an affordable device that has none of the drawbacks (4K block size, wear leveling) of SSD drives for use as a ZIL device. One can now put together a commodity storage solution with this and Nexenta, and have the same expected performance without compromise as one would expect from any first tier storage platform.

That leads me to the "one more thing" category. I decided to place some ESX NFS storage-pooled volumes on this box, and compare it to the performance of the NetApps we use to manage our ESX VMs (NFS). The file access modes of the VMs tend to be similar to mixed size file operations, but they do tend to be larger writes so the ZIL may not have as drastic of an effect. Anyway, I tried it without the X1 and I got 30-40MB/sec measured disk performance from operations within the VM (random tests, dd, etc). Enabling the ZIL device, I got 90-120MB/sec rates, so we still got a 3x improvement. I couldn't easily isolate all traffic away from my NetApps, but I averaged 65MB/sec on those tests.

Here, I think the conclusion I can draw is this: The dedicated ZIL device again improved performance up to matching what I theoretically can get from my network path. The comparison one can safely make with a NetApp is not that its faster, as my test ran under different loads, but that it likely can match the line rates of your hardware and remove from the equation any concern for filesystem and disk array performance. Perhaps in a 10G network environment or with some link aggregation we can start to stress the DDRdrive X1, but for now its obvious that it enables commodity storage solutions to meet typical NAS performance expectations.

Friday, November 27, 2009

ZFS Resilver quirks and how it lies

One of my ZFS-based storage appliance was running low on disk space, and since I made it a three way stripe of mirrored disks, I could take the 6 500GB drives and replace them with 1.5TB drives each in place, with the result a major increase in capacity. Nifty ZFS software RAID feature versus typical hardware RAID setups. Its all good in theory, but resilvering (rebuilding an array pair) after replacing a drive takes quite some time. Even with only about 400GB to rebuild per drive, one sees the resilvering process cover 90% of the rebuild in 12 hours or so, but that last 10% takes another 10-12 hours. I think this has a lot to do with how snapshots or small files hurt ZFS performance, especially when you are close to a full disk. But its all just as guest as to why its slow on the tail end.

The resilver went as planned, replacing one drive after another serially, but taking care to only do one drive of a pair at a time. Near the end, I started to get greedy. With 98% done on one resilver, I detached a drive in another mirrored pair on the same volume, planning on at least placing the new drive into the chassis so I could start the final drive resilver remotely. To my surprise, the resilver restarted from scratch, so I had another 24 hours of delay to go. So, any ZFS drive removals will reset in progress scrubs/resilvers!

I then decided just to go ahead with the second resilver. This is where it got really strange. The two mirrored pairs started to resilver, and the speed was seemingly faster. After 12 hours, both pairs had about 400GB resilvered and the status of the volume indicated it was 100% done and was finishing. Hours later, it was still at 100%, but the resilver counter per drive kept climbing. Finally, after the more typical 24 hours or so, it noted it was completed.

  pool: data
state: ONLINE
scrub: resilver completed after 26h39m with 0 errors on Tue Nov 24 22:33:46 2009
config:

 NAME        STATE     READ WRITE CKSUM
 data        ONLINE       0     0     0
   mirror    ONLINE       0     0     0
     c2t1d0  ONLINE       0     0     0
     c2t0d0  ONLINE       0     0     0
   mirror    ONLINE       0     0     0
     c2t3d0  ONLINE       0     0     0
     c2t2d0  ONLINE       0     0     0  783G resilvered
   mirror    ONLINE       0     0     0
     c2t5d0  ONLINE       0     0     0
     c2t4d0  ONLINE       0     0     0  781G resilvered

Yes, it looks like at least with this B104+ kernel in NexentaStor, the resilver counters lie. When you have two ongoing resilvers, each counter is nominally the total data resilvered across the whole pool. You'll thus need to wait for double the expected data amount before it completes. Thus, its very important to not reset the system until 100% turns into a "resilver completed..." statement in the status report.

Tuesday, August 18, 2009

Prepping for Snow Leopard Server and a lesson on backups

We all know that MacOSX 10.6 Server is coming out RSN. All of us who use OpenDirectory are starting to wonder about the pain that will soon endure when upgrading. Here's a few hints to keep in mind.

- Time Machine Backups do not by default restore a good MacOSX Server image. Read all about it here and learn now what will go wrong. Namely, edit the mentioned StdExclusions.plist file to remove /var/log and /var/spool from the exclusion list, and consider recreating your backups from scratch

- If you have ADC membership or otherwise can purchase WWDC 09 videos, acquire Session 622, Moving to Snow Leopard Server. Lots of good stuff there, but I'll suggest a less than perfect but simpler upgrade path

- To upgrade, use Carbon Copy Cloner or the like to make full bootable system copy on an external drive -- likely your time machine disk. At this point, you can also re-enable Time Machine to use the rest of the disk for backups using the corrected excludes list. Obviously, this disk should be far larger in size than what you have used on your OSX Server.

- You might be upgrading to a beefier 64-bit Intel configuration for your OpenDirectory master or just upgrading in place on the old hardware. I recommend using this on new hardware. Take that clone disk and boot off of it on the new box, and then clone yet again to the local disk or array. Now you can do an in place upgrade to 10.6 on non-production hardware, test, etc. Your previous master is now your first replica when you go production. If you upgrade in place, you should first test that the boot disk works as your primary first, but now you do have a full production-worthy backup disk.

- Once you past a certain point in time, I'd remove the backupdbs on that external disk (don't erase it) and reuse it for Time Machine again. You now have a way to revert to 10.5 pre-upgrade or revert to any 10.6 point in time. You should check the exclusions file before commencing Time Machine backups to make sure you are getting the expected full server backup.

- Profit

Saturday, August 02, 2008

Amanda: simple ZFS backup or S3

When I first started researching ZFS, I found it somewhat troubling that no native backup solution existed. Of course there was the ZFS send/recv commands, but those didn't necessarily work well with existing backup technologies. At the same time, the venerable open source backup solution, amanda had found a way to move beyond its limitation of maximum tape size restricting backup run size. Over time, we have found ways to marry these two solutions.

In my multi-tier use of ZFS for backup, I always need an n-tier component that will allow for permanent archiving to tape every 6 months or year, as deemed fit for the data being backed up. These are full backups only, and due to the large amounts of data in the second tier pool, a backup to tape may span dozens of tapes and run multiple days. I found I had to tweak amanda's typical configuration to allow for very long estimate times, as the correct approach to backing up a ZFS filesystem today involves tar. Amanda's approach does a full tar estimate of a backup before a real backup is attempted. Otherwise, a sufficiently tape library is all you need and a working amanda client configuration on your ZFS-enabled system.

For those following along, I'm an avid user of NexentaStor for my second tier storage solution. Setup of an amanda client on that software appliance is actually quite easy.


setup network service amanda-client edit-settings
setup network service amanda-client conf-check
setup network service amanda-client enable

That's all that one needs to do. There is a sample line in the amanda configuration that you adjust in the first command above. The line I used is similar to this:


amandasrv.stanford.edu amanda amdump

You'll find that depending on your build of amanda server, that you'll either have the legacy user name of "amanda", the zmanda default of "amanda_backup", or the Redhat default of "backup" as the user things run as. I guess there had to be a user naming conflict at some point with "amanda".

The hardest part of the configuration is finding where you have your long term snapshots. Since a backup run can take days to weeks, you'll likely wish to backup volumes relative to a monthly snapshot. In your amanda /etc/amanda/CONFIDR/disklist configuration, a sample you may have for a ZFS-based client named nexenta-nas with volume tier2/dir* is:


nexenta-nas /volumes/tier2/dir1/.zfs/snapshot/snap-monthly-1-latest  user-tar-span
nexenta-nas /volumes/tier2/dir2/.zfs/snapshot/snap-monthly-1-latest  user-tar-span

Note well the use of user-tar-span in the two lines above. This allows for the backing up large volumes over multiple tapes in amanda. That one limitation of tape spanning in amanda was solved in a novel way. They break up backup streams into "chunksizes" of a set size to allow for a write failure at the end of one tape to begin fresh again at the beginning of that chunk on the following tape. This feature allows amanda to also be used to backup to Amazon's S3 service. Yes, instead of going to tape, you can configure a tape server to write to an S3 service. S3 limits writes to a maximum of 2GB a file, and amanda's virtual tape solution combined with that chunk sizing of backups works wonderfully to mate ZFS-based storage solutions to S3 for an n-tier solution. Please consult Zmanda's howto for configuring your server correctly. There really is nothing left to configure to get ZFS data to S3.

Sunday, July 27, 2008

Pogo Linux, Nexenta announce StorageDirector Z-Series storage

Pogo Linux Inc., a Seattle-based storage server manufacturer, and Nexenta Systems Inc., developer of NexentaStor, an open storage solution based upon the revolutionary file system ZFS, announced Wednesday immediate availability of a new set of storage appliances featuring NexentaStor.Yeah.. that was the posted text above. What does it really mean? More kit choices to get a open storage NAS. Some nice configuration options when ordering, but I didn't see an easy was to request smaller system disks versus the rest of the data drives for any given Z series unit. Its a very good first step. If a Linux vendor adopts an appliance based on OpenSolaris (albeit a Debian/Ubuntu-lookalike), you know there is something cooking.

read more | digg story

Monday, June 16, 2008

Closet WO Developer

One of many hats I wear is that of a erstwhile java developer. Our internal apps have been heavily reliant on object relational mappers. I've dabbled into RoR, and even helped get a TurboGears project off the ground here that was fully open source. However, the primary solution we've used in production since 2002 has been the grand daddy of ORM solutions: WebObjects.

This past week at Apple's WWDC was a great one for WebObjects. The usual NDA applies. However, prior to that the WebObjects community had their own two day in-depth conference in San Francisco. No NDA for that, and I can report that WO development is alive, well, and dare I say thriving? The news about SproutCore has a second story, in that the backend of choice may be RoR, but the #1 apps will likely also be WO-based. Got an iPhone? Learn WO. As we opensource some of our projects here, I'll write a few more posts and speak on some more points, but with the latest release of WebObjects (5.4.x) the final deployment restrictions on the free WO frameworks were lifted. I expect some level of renewed interest.

To find out more, check in with the WOCommunity.