Friday, November 27, 2009

ZFS Resilver quirks and how it lies

One of my ZFS-based storage appliance was running low on disk space, and since I made it a three way stripe of mirrored disks, I could take the 6 500GB drives and replace them with 1.5TB drives each in place, with the result a major increase in capacity. Nifty ZFS software RAID feature versus typical hardware RAID setups. Its all good in theory, but resilvering (rebuilding an array pair) after replacing a drive takes quite some time. Even with only about 400GB to rebuild per drive, one sees the resilvering process cover 90% of the rebuild in 12 hours or so, but that last 10% takes another 10-12 hours. I think this has a lot to do with how snapshots or small files hurt ZFS performance, especially when you are close to a full disk. But its all just as guest as to why its slow on the tail end.

The resilver went as planned, replacing one drive after another serially, but taking care to only do one drive of a pair at a time. Near the end, I started to get greedy. With 98% done on one resilver, I detached a drive in another mirrored pair on the same volume, planning on at least placing the new drive into the chassis so I could start the final drive resilver remotely. To my surprise, the resilver restarted from scratch, so I had another 24 hours of delay to go. So, any ZFS drive removals will reset in progress scrubs/resilvers!

I then decided just to go ahead with the second resilver. This is where it got really strange. The two mirrored pairs started to resilver, and the speed was seemingly faster. After 12 hours, both pairs had about 400GB resilvered and the status of the volume indicated it was 100% done and was finishing. Hours later, it was still at 100%, but the resilver counter per drive kept climbing. Finally, after the more typical 24 hours or so, it noted it was completed.

  pool: data
state: ONLINE
scrub: resilver completed after 26h39m with 0 errors on Tue Nov 24 22:33:46 2009

data ONLINE 0 0 0
mirror ONLINE 0 0 0
c2t1d0 ONLINE 0 0 0
c2t0d0 ONLINE 0 0 0
mirror ONLINE 0 0 0
c2t3d0 ONLINE 0 0 0
c2t2d0 ONLINE 0 0 0 783G resilvered
mirror ONLINE 0 0 0
c2t5d0 ONLINE 0 0 0
c2t4d0 ONLINE 0 0 0 781G resilvered

Yes, it looks like at least with this B104+ kernel in NexentaStor, the resilver counters lie. When you have two ongoing resilvers, each counter is nominally the total data resilvered across the whole pool. You'll thus need to wait for double the expected data amount before it completes. Thus, its very important to not reset the system until 100% turns into a "resilver completed..." statement in the status report.