A stressful upgrade – and a design for a better backup system (by )

My server cluster has been having lots of outages lately - which I traced down, after much experimentation, to probably being the ethernet driver in NetBSD 3.1 (which I was running), since the machine in question (as luck would have it, the NFS/NIS server; mental note, make the other server be a NIS slave so it can run on its own...) seemed to just disappear from the network but be perfectly happy if spoken to over a serial console - but ifconfig wm0 down ; ifconfig wm0 up would then hang it.

So, since a machine with the same Ethernet interface but running NetBSD 4.0 was running fine, and I could see there'd been a lot of commits to the driver between the versions, I decided to upgrade it.

Easy enough, right? Stick in the NetBSD 4 boot CD, boot, select Upgrade. But then problems struck; my /usr and /var are on a RAIDframe mirror set, and the install kernel didn't have RAIDframe installed. So I just let it install into /usr and /var directories under the root directory, and then booted into the new system, mounted the RAID, and copied the new /usr over my old one in order to upgrade all the binaries while leaving /usr/pkg and /usr/local untouched.

Much hilarity then ensued, as a bunch of core shared libraries are in /lib so a system without /usr mounted can run, but are symlinked into /usr/lib for normal use - but my /usr is a symlink to /data0/usr, so the ../../lib/libc.so.12 symlink targets weren't finding /lib. Easily enough fixed, once I'd realised what was going on, but it took a while.

So I set the system compiling up a custom kernel to replace the GENERIC one I was running, and recompiling the contents of /usr/pkg (my installed packages), which were all compiled for the old kernel so running under emulation. And left it.

Only to find, when I checked back later, that it had died and was sitting at a boot prompt asking to go into single user mode since file system checks had failed. I'd timed my upgrade to coincide with a scheduled power outage that was happening anyway that day due to recabling the power, but after I found the machine failing to boot, I took a peek and noticed that the power cables we'd plugged in had been rather neatened up - and, I suspect, in so doing, wiggled slightly enough to reboot... while the machine was compiling heavily, so there was lots of filesystem corruption.

Fair enough, a manual fsck and we were up and compiling again. But when I checked again, the kernel had paniced in the filesystem code. So I rebooted again, and there was another filesystem repair, and tried again...

...and another panic, before long. And another filesystem repair. Hmmm.

So I did an experiment. I did a filesystem repair, confirmed all the fixes it proposed, then I immediately ran another one. And the second one found more errors, which isn't supposed to happen.

It occurred to me to check the RAID status. "Parity rebuild: 23% complete", it said, or words to that effect.

A theory formed; my disks are software RAID-1, meaning that the two disks are identical copies. Every write goes to both disks, so that if a disk fails, there's another copy available. And reads can come from either disk, which increases the read speed. But if the system is powered off unceremoniously while writes were being done, then the two copies might differ as the system was in the process of updating.

"Rebuilding parity" doesn't quite make sense in a RAID-1 system rather than RAID-5, but in a situation where the RAID system had been shut down unceremoniously, I would expect the system to scan the two mirrors and, wherever there was a difference, choose one version or the other (unless either had a bad checksum to discredit it) and copy that to the other. Which is presumably what it was doing.

I've not looked into the sources yet, and only skimmed the manual, but I suspect that RAIDframe (the software RAID system) was still servicing read requests from both disks while it was checking the mirroring, rather than lazily checking the parity on every block it reads to ensure consistent results. Either that, or a bug in fsck that was making it fail to correctly fix the filesystem; I don't know.

Either way, I kept the system in single-used mode until the parity was all checked, then did a complete thorough filesystem check on all disks, before going multi-user and starting compiling again.

This time it worked fine.

But the good news is, I took a full backup before doing the upgrade. Not that I needed it; which was lucky, since I used duplicity to back up to Amazon S3 - and it took several days to do a full backup. So I'd hate to have to do a full restore.

Pages: 1 2

3 Comments

  • By Improbulus, Mon 15th Dec 2008 @ 4:44 pm

    Whew! I'd have thought you'd have wanted to have a break at Xmas, but I guess building a backup system is as good as break to you, innit? 😛 🙂 Good luck - but do make sure you have some kind of a rest, at least! unlike those of us who'll have to swot..

  • By Faré, Mon 22nd Dec 2008 @ 12:31 am

    Congratulations for salvaging the system.

    Speaking of backups, are you satisfied with duplicity? What else did you try? what made you choose it over competition? Would you recommend it or try something else?

  • By Faré, Mon 22nd Dec 2008 @ 2:19 am

    Oops, hadn't read page 2.

    Yes, I'd like such a system. But why limit it to backups? If you standardize on a small-enough bucket size (say, 64KB to a few MB) and have a general content-addressed mechanism (say, using Tiger Tree-Hash - SHA-1 is broken!), then your protocol is also perfect as a storage backend for file-sharing (Gnutella also uses TTH).

    Do you have working code already?

Other Links to this Post

RSS feed for comments on this post.

Leave a comment

WordPress Themes

Creative Commons Attribution-NonCommercial-ShareAlike 2.0 UK: England & Wales
Creative Commons Attribution-NonCommercial-ShareAlike 2.0 UK: England & Wales