Tuesday, May 27, 2008

The problem with slogs (How I lost everything!)...

A while back, I spoke of the virtues of using a slog device with ZFS. The system I went into production with had an Nvidia-based SATA controller onboard and a Gigabyte i-RAM card. No problems there, but at the time it was a cmdk driver (PATA mode) for my OpenSolaris-based NexentaStor NAS. After a while, I got an error where the i-RAM "reset" and the log went degraded. The system simple started to use the data disks for the intent log. So, no harm done. Its important to note that the kernel was a B70 OpenSolaris build.

Later, I wanted to upgrade to NexentaStor 1.0, which had B85. Post upgrade or using a boot CD, it would never come up with the i-RAM attached. The newer kernel was an nv_sata driver, and I could always get it to work in B70, so I reverted to that. This is one nice feature that Nexenta has had for quite some time, in that the whole OS is checkpointed using ZFS to allow reversions if an upgrade doesn't take. Well, the NAS doesn't like having a degraded volume, so I've been trying to "fix" the log device. Currently, in ZFS, log devices cannot be removed, but only replaced. So, I tried to replace it using the "zpool" command. Replacing the failed log with itself always fails as its "currently in use by another zpool". I figured out a way around that, and that was to fully clear the log using something like "dd if=/dev/zero of=/dev/mylogdevice bs=64k". I was able to upgrade my system to B85, and then I attempted to replace the drive again, and it looked like it was working:

pool: data
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scrub: resilver in progress for 0h0m, 0.00% done, 450151253h54m to go

data DEGRADED 0 0 0
raidz1 ONLINE 0 0 0
c7t0d0 ONLINE 0 0 0
c7t1d0 ONLINE 0 0 0
c6t1d0 ONLINE 0 0 0
logs DEGRADED 0 0 0
replacing DEGRADED 0 0 0
c5d0 UNAVAIL 0 0 0 cannot open
c8t1d0 ONLINE 0 0 0

errors: No known data errors

Note well, that it is replacing one log device with another (using the new nv_sata naming). However, after it reached 1% it would always restart the resilver with no ZFS activity, no snapshots, etc. The system was busy resilvering and resetting, getting no where. I decided to reboot to B70, and as soon as that came up, it started to resilver immediately and it proceeded after quite a long time for a 2GB drive to complete the resilver. So, everything was now fine, right?

This is where things really went wrong. At the end of the resilver, it still considered the volume degraded, and looked like the above output but with only one log device. Rebooting the system, the volume started spewing out ZFS errors, and checksums counters went flying. My pool went offline. Another reboot, this time with the log device disconnected due to nv_sata not wanting it connected for booting purposes causes immediate kernel panics. What the hell was going on? Using the boot cd, I tried to import the volume. It told me that the volume had insufficient devices. A log device shouldn't be necessary for operation, as it hadn't needed it before. I attached the log device and ran cfgadm to configure it, which works and gets around the boot time nv_sata/i-RAM issue. Now it told me that I have sufficient devices, but what happened next was worse. The output showed that my volume consisted of one RAIDZ, an empty log device definition, and additionally my i-RAM as an additional degraded drive added to the array as a stripe! No ZFS command was run here. It was simply the state of the system relative to what the previous resilver had accomplished.

Any attempt to import the volume fails with a ZFS error regarding its inability to "iterate all the filesystems" or something to that affect. I was able to mount various ZFS volumes read-only by using the "zfs mount -o ro data/proj" or similar. I then brought up my network and manually had to transfer the files off to recover, but this pool is now dead to the world.

What lessons have I learned? Slog devices in ZFS, though a great feature, should not be used in production until they can be evacuated. There may be errors in the actions I took above, but bugs that I see include the inability for the nv_sata driver to deal with the i-RAM device for some odd reason, at least in B82 and B85 (as I've so far tested). The other bug is that a log replace appears to either not resilver at all (B85) or, when resilvering in older releases, causes the system to not correctly resilver the log but instead to shim the slog in as a data stripe. I simply can't see how that is by any stretch of the imagination by design.


jmlittle@gmail.com (Joe Little) said...

Eric Schrock noted over in the zfs-discuss list that " The basic problem is that log devices are kept within the normal vdev tree, and are only distinguished by a bit indicating that they are log devices (and is the source for a number of other inconsistencies that Pwel has encountered).

When doing a replacement, the userland code is responsible for creating the vdev configuration to use for the newly attached vdev. In this case, it doesn't preserve the 'is_log' bit correctly. This should be enforced in the kernel - it doesn't make sense to replace a log device with a non-log device, ever."

So, I'm not completely nuts :)

Kris said...

Just partially.. :)