Saturday, May 03, 2008

ZFS: Is the ZIL always safe?

One of my ZFS-based appliances, used for long term backup, was upgraded from B70 to B85 of OpenSolaris two weeks ago. This time around, I re-installed the system to get RAIDZ2, and certain "hacks" that I've been using were no longer in place. The old settings were in /etc/system, and are the well known zil_disable and zfs_nocacheflush enabling. They were left there from when the system temporarily acted as a primary server for a short time with its Adaptec (aac) SATA RAID card and its accompanying SATA-I drives. Since the unit was UPS attached, it was relatively safe for NFS client access, and later on there was no direct client access over NFS. No harm done, and stable for quite some time over multiple upgrades from B36 or so, over a year without an error.

A curious thing happened as soon as I upgraded without these somewhat unsafe settings for the kernel. I started to get tons of errors and twice my pool as gone completely offline until I cleared and scrubbed it. An example of the errors:

NAME STATE READ WRITE CKSUM
tier2 DEGRADED 0 0 0
raidz2 DEGRADED 0 0 0
c1t1d0 FAULTED 0 64 0 too many errors
c1t2d0 DEGRADED 0 46 0 too many errors
c1t3d0 DEGRADED 0 32 0 too many errors
c1t4d0 DEGRADED 0 47 0 too many errors
c1t5d0 DEGRADED 0 39 0 too many errors
c1t6d0 FAULTED 0 118 0 too many errors
c1t7d0 DEGRADED 0 57 0 too many errors


Nothing explained the turnaround from stable to useless for any writes. I also got some read errors, and no nightly rsync against this tree would survive without incrementing some error count. Was it somehow one of my cache settings on the adaptec card that conflicted with a new version of the "aac" driver? I thought I would need to isolate it, revert perhaps, or consider that somehow my card was simply dying. Perhaps the cache/RAM on the card itself was toast.

A recent post on the opensolaris-discuss mailing lists gave me an idea. Mike DeMarco suggested to a user suffering from repeated crashes that corrupt ZFS until cleared to try and use zil_disable to test "if zfs write cache of many small files on large FS is causing the problems." Makes some sense if the card is somehow trashing on small writes. The use of it for backup means that its being read and written to via rsync and can involve many small updates. I also had various read errors pop up. So, I put the old faithful zil_disable and for good measure the zfs_nocacheflush back after another degraded pool, and after a reboot and scrub, let it do its nightly multi-terabyte delta rsyncs. After a few days, there are no errors. Have I stumbled onto some code path bug that is ameliorated by these kernel options? Do newer kernels have suspect aac drivers?

Perhaps someone will prove the logic of the above all wrong, but for now, I'm returning to the old standby "unsafe" kernel options to keep my pool stable.

No comments:

Followers