Backups and Archives (by )

I'm always slightly frustrated in my attempts to create efficient backup and archival systems for my stuff; because the way filesystems are managed works against me.

The contents of my disks boil down into a few different categories, for backup purposes:

  1. Operating system and application source binaries, and other static stuff that can be fetched automatically. In particular, as a NetBSD and pkg_chk user, I can take a blank machine, install NetBSD from a CD-ROM, set up pkg_chk drop a pkg_chk.conf file in and run pkg_chk to have it fetch, compile, and install all my applications for me. This stuff does not need backing up.
  2. Temporary files. /tmp, cache directories, and various bits of /var on UNIX. Things that do store locally-created state, but stuff I can happily lose. This stuff does not need backing up.
  3. Version-control system checkouts. These are files that are copied from a server somewhere, but that I've been modifying. I ought to commit my changes back to the server after I've done anything. It's not worth backing up anything that's not been changed since it came from the server, and backups of my changes would be nice but aren't really necessary.
  4. Data that's not under version control. Such as the version control repositories themselves on the servers, things that won't neatly go into version control (databases, and MacOS X bundles and other such directory trees that are aggressively managed by applications like the iPhoto and iTunes folders, etc). There's
  5. My archives. By which I mean my MP3 collection, photos, downloaded PDFs I particularly want to keep, etc. This is stuff that I don't want to lose since it would be a pain or impossible to recreate, is typically rather bulky (running into gigabytes). However, the files are never changed once they arrive; they just slowly aggregate.

Each of these categories really needs a different backup regime. Groups one and two are not worth backing up at all; group one can be recreated from things like pkg_chk.conf which is a tiny file that falls into group 4, given a network connection, while group 2 will usually be recreated automatically. Group three doesn't need backing up either, although some kind of lightweight real-time differential backup technology might be worth doing. Groups four and five needs regular backing up, but due to the widely differing update patterns different backup systems might work.

In particular, group 5 (as my name for them suggests) is more suited to an archival system. You don't want to be doing regular full backups of many gigabytes of data that varies slowly - much easier to keep scanning for new files and appending them to two redundant archives. Venti would be ideal.

While group 4 needs traditional backup - periodic full dumps punctuated by more frequent incrementals if using batch-based backup media, or rsyncing to generational snapshot disks.

The problem, though, is that files of all these categories are hard to differentiate in the file system. I have to manually specify detailed lists of path names to include or exclude from my rsync backups, which requires specialist UNIX knowledge, and is time consuming and error prone. If an application I install puts something in a place I'm excluding and I don't realise, then I risk data loss. If an application I install decides to dump a large cache inside a dotfile in my home directory, my backups suddenly start ballooning.

This sucks! But there's no easy solution without breaking backwards compatibility. Adding archival flags to files is fine, but existing applications won't know to set the flags on the files they install; it's really no better than having explicit lists of paths to include and exclude if I have to do it by hand anyway. Better is to do what MacOS X has done and reorganise the filesystem; /System, /Library and /Applications are all reconstructable from install media, while /Users needs backing up - but things go downhill as you look closer into /User; ~/Library/Caches and ~/Library/Mail Downloads (which is a cache of files from my IMAP server) live alongside important things that don't exist anywhere else, like my application settings.

And changing the filesystem structure really DOES require changing all your applications.

I suppose the best we can do is a combination of static path definitions, as I'm doing, and training backup systems to recognise version control system checkouts and being configurable to treat them differently (perhaps just backing up the current svn diff and the revision number the checkout is from rather than backing up the entire checkout...)

3 Comments

  • By @ndy, Fri 11th Jul 2008 @ 4:38 pm

    I think I disagree with you here.

    You should always have backups of everything that you can't afford to loose and you certainly can't afford to loose your OS and installed apps (group 1) even if you can restore them from the original media.

    If you adopt the above then your backup strategy dictates your application vendor strategy. i.e you have to use pkgsrc for everything and you can't make any local modifications otherwise you end up in the position of having to backup the third party apps or modified sources anyway and then you may as well be backing up everything.

    Group 2: /tmp is wiped on every boot anyway. /var: This contains lots of important stuff like mail spools and other transient but critical things. It's also the hardest to backup as you need to ensure that, for example, you don't catch your mail server or db server in the middle of the write. So you need some kind of freezer and that can affect your HA requirements.

    As everyone knows, most backup strategies fail at the restore stage because no one ever tests them properly.

    If you elect to reinstall your OS and apps via a download from, say, the NetBSD project server or from a vendor CD then you have no guarantee that when you come to do the restore you'll have access to it. You also have no way of ensuring that your post restore is exactly the same as your pre disaster environment and that is very important.... and running the install programs for every little thing always takes longer than "restore /dev/sd1" and going for a coffee.

    If you suffer a failure in the middle of something important then you want to be able to get back to exactly where you were as quickly as possible. You don't want to have to worry about whether you're going to get unexpected package upgrades when you reinstall from the pkgsrc config file and you certainly don't want to discover that a hastily tweaked vendor config file is now back in its default state.

    So, when you install a machine, and every so often (maybe every year) you should do a complete dump of the OS and the apps. Every week or month you should do an incremental dump: these things don't change very often... hell... I just talked myself into doing this dump every day: most day's it'll be zero.

    Group 3: So what if it came from the server? There are always changes in there that you don't really want to commit yet but would really rather not lose. Don't rely on your ability to remember to save each one of them in a specially named patch file: one day you will forget.

    Group 4: This is going to be data in /var or data in /home. It'll get covered by default by a good backup strategy so doesn't need a section of its own.

    Group 5: Very similar data life model to group 1. Therefore same strategy is required.

    I'm all in favour of a layered strategy and possibly having slightly different strategies for group 1/5 data than for /home and /var data. However, if you design your layers properly I think you can just apply the same thing everywhere and group 1/5 data will just produce lots of zero length backups which are essentially free.

    My backup strategy starts when I first install a machine. Lots of people argue that everything should be in a single large partition. I'm more aligned to the old school philosophy of different partitions and file systems for different types of data. wrt backups it gives you lots of flexibility when it comes to restore and quota management.

    / and /usr: Make this a partition. Make it read only if you like. A read only means you can back it up less frequently that a read write one.

    /var: Make this a partition.

    /home: Make this a network mount or a partition.

    /usr/local: Make this a partition as well: it stops you reaching for your backups everytime a badly behaved system installer stamps all over /usr/

    /export: I tend to make this a partition as well since that's where I store all my really big non user specific data. I use it as a place to send backups from other machines and as a place for OS install media and general network shares and stuff. You might call this /data or something else.

    Layers are important and not every layer of backup needs to exist for each partition.

    There are offsite tape backups that need to exist for everything but you can let them get quite old in some instances. These guard against complete system failures.

    There are onsite tape backups that could be the incremental versions of your offsite set. These guard against complete system failures and week-to-week dataloss such as viruses.

    There are onsite online backups that are made frequently. These guard against momentary brain absences such as "rm myfile" instead of "rm myfile~".

    So, for me, the layers work like this:

    I (should) have a set of tape backups. (I've designed how they will work but I've yet to actually make any.) I make a "level 0" dump of all my filesystems every 6 months. That takes as much tape as I have data. I think I can get it onto no more than a couple of DDS3s per machine. I then do a weekly "towers of hanio" incremental backup. The first 9 weeks go like this:

    0 3 2 5 4 7 6 9 8

    Then I do a level 1 backup:

    1 3 2 5 4 7 6 9 8

    and again:

    1 3 2 5 4 7 6 9

    By that time 26 weeks have passed so I take another level 0 backup, but you can keep doing it for as long as you want. When the level 1 backup gets bigger than 1 tape it starts to get messy. So, given that I use DDS3 I can generate between 12 and 24GB of new data every 6 months and not use more than 1 tape for anything other than the level 0 backups.

    For file systems that don't change much such as / the incremental dump runs fast and doesn't take up much space on the tape and I still capture all the little details that trip you up on a restore.

    I've got a tape schedule that tries to spread the wear evenly over all the tapes as much as possible. It's quite complex and I'm going to need a script that tells me which tape to get from where each week.

    For my online backups I have a seperate disk to which I rsync my partitions. I use a rather modified version of this script: http://www.mikerubel.org/computers/rsync_snapshots/

    This uses hardlinks to preserve space for files that haven't changed and therefore you need to ensure you have lots of inodes. You get fine grained backups of rarely changing partitions for almost the same cost as frequently changing ones. i.e. the cost in terms of time and disk space scales per MB of change (plus a constant) rather than per frequency of backup.

    My mods amount to storing YYYYMMDDHHMM style folders rather than rotating ones and keeping the snapshots until I delete them: I've written an algorithm that will use some kind of exponential decaying reaper to decide what to delete but I haven't integrated it yet.

    I run these backups when I feel like it but I should really aim to do it at least once a day, maybe and maybe hourly for /home.

    If I had important work for an important client then I'd probably run it at least hourly on some partitions.

    I'm wondering if encrypted backups are a good idea. Obviously I'd have to backup the keys somehow. I could encrypt the content of the tapes and I'd be able to send them out to anyone of my "friends" who'd take them. I could then send USB keys containing the keys to, say, my parent's house.

    So in summary, + You want to design a strategy that doesn't require any maintenance once it's set up: you just feed blank media and rest assured that it'll get all of the important stuff. + It needs to be essentially free (in MB and time) to run backups on stuff that hasn't changed: if it isn't then you'll try to economise by grouping things that are backed up and then stuff will invariably fall through the cracks. This is kind of like a special case of the first point.

    + You need to have a single action restore:
        just unpack and go. If you have to configure anything or remember anything when you're stressed then it will make life very unpleasant and error prone. If someone else can't do it for you then it's not simple enough.
    

    I initially drew my inspiration from this site: http://www.taobackup.com/

    It's an advert for a commercial product, but it does contain some good advice.

  • By alaric, Fri 11th Jul 2008 @ 9:15 pm

    I think I disagree with you here.

    Oooh!

    You should always have backups of everything that you can't afford to loose and you certainly can't afford to loose your OS and installed apps (group 1) even if you can restore them from the original media.

    If you adopt the above then your backup strategy dictates your application vendor strategy. i.e you have to use pkgsrc for everything and you can't make any local modifications otherwise you end up in the position of having to backup the third party apps or modified sources anyway and then you may as well be backing up everything.

    Ah, not necessarily. pkgsrc installs stuff into /usr/pkg - anything else should go into /usr/local, so can be backed up.

    Group 2: /tmp is wiped on every boot anyway.

    Yep, all the more reason to exclude it from backups 😉

    /var: This contains lots of important stuff like mail spools and other transient but critical things. It's also the hardest to backup as you need to ensure that, for example, you don't catch your mail server or db server in the middle of the write. So you need some kind of freezer and that can affect your HA requirements.

    Yeah... I try and avoid databases that can't recover from a bad shutdown, though. When I have the choice.

    If you elect to reinstall your OS and apps via a download from, say, the NetBSD project server or from a vendor CD then you have no guarantee that when you come to do the restore you'll have access to it. You also have no way of ensuring that your post restore is exactly the same as your pre disaster environment and that is very important.... and running the install programs for every little thing always takes longer than "restore /dev/sd1" and going for a coffee.

    But all the software on a UNIX server gets recreated from scratch fairly regularly anyway. How many weeks go by before some fundamental component grows a security vulnerability and you need to reinstall it and the things that depend upon it? Having the system come back neatly from a rebuild is as important as having it come back after a reboot - failing to do so is a sign of sloppiness!

    Heck, I did a reinstall of most of pkgsrc on infatuation at the start of last week, and since then, vulnerabilities have been found in its pcre and ruby packages... not to mention that mutt's been vulnerable to a 'signature spoofing' attack (whatever that is) for ages now, but there's no fixed version yet available.

    If you suffer a failure in the middle of something important then you want to be able to get back to exactly where you were as quickly as possible. You don't want to have to worry about whether you're going to get unexpected package upgrades when you reinstall from the pkgsrc config file and you certainly don't want to discover that a hastily tweaked vendor config file is now back in its default state.

    So, when you install a machine, and every so often (maybe every year) you should do a complete dump of the OS and the apps.

    But such a snapshot will be dangerously out of date after a couple of months, and not safe to restore onto a network-connected system 🙁

    I'm all in favour of a layered strategy and possibly having slightly different strategies for group 1/5 data than for /home and /var data. However, if you design your layers properly I think you can just apply the same thing everywhere and group 1/5 data will just produce lots of zero length backups which are essentially free.

    My main argument, though, is that it's a pain to separate things into the groups, due to the way they're spread all over the filesystem.

    In an Ideal World, I'd love a system with top-level directories for Installed Stuff (OS and apps), Temporary Stuff, and Interesting Stuff 😉 Yeah, having VCS checkouts handled specially is a small win; it may just be easier and safer to just back them up with everything else

    / and /usr: Make this a partition. Make it read only if you like. A read only means you can back it up less frequently that a read write one.

    I have a new blog post in the pipeline about configuring systems to boot from a CD-ROM, actually, containing a 'live filesystem' for / and /usr, over which /etc is union-mounted from the hard disk (so the on-disk /etc only contains changed files)... but the reasons and tradeoffs involved will have to wait for that posting to be gone over in detail 😉

    /export: I tend to make this a partition as well since that's where I store all my really big non user specific data. I use it as a place to send backups from other machines and as a place for OS install media and general network shares and stuff. You might call this /data or something else.

    nod

    [Andy's backup rota]

    I have a vaguely similar approach. Since I have two machines side by side in the datacentre joined by a cable, backups between them are cheap, so my most important data - the svn repositories, trac databases, and SQL databases - are dumped (not at the filesystem level, but at a logical level respecting locks and stuff) from the fileserver (infatuation) to the other machine (fear) on a nightly basis, with generations of dumps.

    Then the interesting filesystems of both machines (excluding a whole bunch of cache/temp directories, pkgsrc, the actual SQL database files, and so on, but definitely including the nightly dump files on fear) are rsynced down to a 500GB disk at home. I rsync infatuation (it being the fileserver) most weeks, and occasionally do fear or pain if either has been 'worked on' recently so will have any interesting changes. I like rsync, since it gives me a real filesystem I can go and look into for interesting things without needing to mess with restore apps.

    Pah, looking over rsync's logs, it looks like I need to add a new directory to the ignore list: /var/tmp. /var really is a worst-case for backing up 😉

    But I want to do more than just have a single rsync snapshot - I'd really like to set up something like venti on a separate disk array. Why venti and not rsync-with-hardlink-snapshots? Well, because a venti-like system can just grow linearly by adding extra disks without any nasty filesystem reshuffling! I'm not so keen on tapes since the size of them is so limited, the hardware to read them is expensive, etc - pools of USB hard disks stacked up on an isolated pair of mini-ITX fanless backup servers would be better, I think!

    I'm wondering if encrypted backups are a good idea. Obviously I'd have to backup the keys somehow. I could encrypt the content of the tapes and I'd be able to send them out to anyone of my "friends" who'd take them. I could then send USB keys containing the keys to, say, my parent's house.

    Yeah, I've wondered about that. I'm feeling compelled to err on the side of safety for now, though.

    So in summary, + You want to design a strategy that doesn't require any maintenance once it's set up: you just feed blank media and rest assured that it'll get all of the important stuff. + It needs to be essentially free (in MB and time) to run backups on stuff that hasn't changed: if it isn't then you'll try to economise by grouping things that are backed up and then stuff will invariably fall through the cracks. This is kind of like a special case of the first point.

    The problem is that the vast bulk of the changes to a system are just software being reinstalled due to security loopholes... /usr/pkg is about a gigabyte on most of my machines, but with a high turnover.

    Take infatuation as an example. We have 760MiB in /usr/pkg. /usr/pkgsrc, which is a combination of stuff out of NetBSD's cvs server and the downloaded source tarballs, is 927MiB - 622MiB of that being the source tarballs. And the pkgsrc work directory, where the tarballs are extracted and the build process occurs (but is kept around in case I need to reinstall the packages later, generally) is currently 477MiB. So that's (760+927+477)MiB = ~2GiB that can all be regenerated by downloading pkgsrc and dropping in the configuration fiels and telling pkg_chk to do its thing... while saving me 2GiB in data transfer 😉

    • You need to have a single action restore: just unpack and go. If you have to configure anything or remember anything when you're stressed then it will make life very unpleasant and error prone. If someone else can't do it for you then it's not simple enough.

    Ok, but I'm used to rebuilding my systems from base configurations. Certainly, I need to script it more, though 😉 But wait until you see my boot-from-CD plan, which will mean all my servers share a SINGLE boot CD-ROM with the core OS on, and the hard disk partitioned into lots of Xen images...

  • By David Cantrell, Tue 15th Jul 2008 @ 10:50 am

    I just backup everything. It's easier than figuring out what should be backed up. I use rsnapshot to do it, which Does The Right Thing for data that rarely changes.

Other Links to this Post

RSS feed for comments on this post.

Leave a comment

WordPress Themes

Creative Commons Attribution-NonCommercial-ShareAlike 2.0 UK: England & Wales
Creative Commons Attribution-NonCommercial-ShareAlike 2.0 UK: England & Wales