Friday, November 27, 2009

ZFS Resilver quirks and how it lies

One of my ZFS-based storage appliance was running low on disk space, and since I made it a three way stripe of mirrored disks, I could take the 6 500GB drives and replace them with 1.5TB drives each in place, with the result a major increase in capacity. Nifty ZFS software RAID feature versus typical hardware RAID setups. Its all good in theory, but resilvering (rebuilding an array pair) after replacing a drive takes quite some time. Even with only about 400GB to rebuild per drive, one sees the resilvering process cover 90% of the rebuild in 12 hours or so, but that last 10% takes another 10-12 hours. I think this has a lot to do with how snapshots or small files hurt ZFS performance, especially when you are close to a full disk. But its all just as guest as to why its slow on the tail end.

The resilver went as planned, replacing one drive after another serially, but taking care to only do one drive of a pair at a time. Near the end, I started to get greedy. With 98% done on one resilver, I detached a drive in another mirrored pair on the same volume, planning on at least placing the new drive into the chassis so I could start the final drive resilver remotely. To my surprise, the resilver restarted from scratch, so I had another 24 hours of delay to go. So, any ZFS drive removals will reset in progress scrubs/resilvers!

I then decided just to go ahead with the second resilver. This is where it got really strange. The two mirrored pairs started to resilver, and the speed was seemingly faster. After 12 hours, both pairs had about 400GB resilvered and the status of the volume indicated it was 100% done and was finishing. Hours later, it was still at 100%, but the resilver counter per drive kept climbing. Finally, after the more typical 24 hours or so, it noted it was completed.

  pool: data
state: ONLINE
scrub: resilver completed after 26h39m with 0 errors on Tue Nov 24 22:33:46 2009
config:

NAME STATE READ WRITE CKSUM
data ONLINE 0 0 0
mirror ONLINE 0 0 0
c2t1d0 ONLINE 0 0 0
c2t0d0 ONLINE 0 0 0
mirror ONLINE 0 0 0
c2t3d0 ONLINE 0 0 0
c2t2d0 ONLINE 0 0 0 783G resilvered
mirror ONLINE 0 0 0
c2t5d0 ONLINE 0 0 0
c2t4d0 ONLINE 0 0 0 781G resilvered

Yes, it looks like at least with this B104+ kernel in NexentaStor, the resilver counters lie. When you have two ongoing resilvers, each counter is nominally the total data resilvered across the whole pool. You'll thus need to wait for double the expected data amount before it completes. Thus, its very important to not reset the system until 100% turns into a "resilver completed..." statement in the status report.

Tuesday, August 18, 2009

Prepping for Snow Leopard Server and a lesson on backups

We all know that MacOSX 10.6 Server is coming out RSN. All of us who use OpenDirectory are starting to wonder about the pain that will soon endure when upgrading. Here's a few hints to keep in mind.

- Time Machine Backups do not by default restore a good MacOSX Server image. Read all about it here and learn now what will go wrong. Namely, edit the mentioned StdExclusions.plist file to remove /var/log and /var/spool from the exclusion list, and consider recreating your backups from scratch

- If you have ADC membership or otherwise can purchase WWDC 09 videos, acquire Session 622, Moving to Snow Leopard Server. Lots of good stuff there, but I'll suggest a less than perfect but simpler upgrade path

- To upgrade, use Carbon Copy Cloner or the like to make full bootable system copy on an external drive -- likely your time machine disk. At this point, you can also re-enable Time Machine to use the rest of the disk for backups using the corrected excludes list. Obviously, this disk should be far larger in size than what you have used on your OSX Server.

- You might be upgrading to a beefier 64-bit Intel configuration for your OpenDirectory master or just upgrading in place on the old hardware. I recommend using this on new hardware. Take that clone disk and boot off of it on the new box, and then clone yet again to the local disk or array. Now you can do an in place upgrade to 10.6 on non-production hardware, test, etc. Your previous master is now your first replica when you go production. If you upgrade in place, you should first test that the boot disk works as your primary first, but now you do have a full production-worthy backup disk.

- Once you past a certain point in time, I'd remove the backupdbs on that external disk (don't erase it) and reuse it for Time Machine again. You now have a way to revert to 10.5 pre-upgrade or revert to any 10.6 point in time. You should check the exclusions file before commencing Time Machine backups to make sure you are getting the expected full server backup.

- Profit

Followers