A stressful upgrade – and a design for a better backup system (by )

I'd decided I wanted to back up to S3 or something similar, since my previous backup system worked by rsync-ing files down my home ADSL to a large USB disk, and it was rather limited by my home bandwidth, and having to manage my own disk which would grow old and die some day, and be awkward when it got full, and so on. I had a plan to knock together an archival system based on Venti-style content-addressable storage, but for expedience, had just gone with duplicity since it was already written. It uses the rsync algorithm to work out deltas to upload in order to produce incremental backups relative to an initial full backup.

However, in order to do a restore, it then has to examine the full backup and all pertinent deltas since. Which means that you periodically need to do more full backups in order to give it more recent checkpoints to work forwards from - and so you can delete older deltas, as they are then no longer required to recover the most recent state.

Which isn't very efficient. A content-addressable snapshot system lets you just upload deltas, yet can still always retrieve the most recent state without needing to follow a delta chain, while not needing intelligence on the server side. Which is just what we want. With reference counts, you can also support the removal of old snapshots, too.

So I think I am going to have to write that archival system after all.

Here's how I want to make it work. It'll have pluggable back-ends, so it can work on local filesystems, S3, sftp servers, WebDAV, or any other store capable of getting and setting blocks of data based on a key. The back-end will have the following operations:

  • getmaxsize() - returns the largest size of block we can store
  • void put(key,data,type) - uploads the data block under the key, and sets its refcount to 1. The key and type are strings, the data is a block of bytes up to maxsize bytes long. Unless there's already a data block of that type under that key, in which case nothing happens.
  • bool exists?(key,type) - returns true if the referenced block of the given type exists
  • data get(key,type) - returns the contents and type of the referenced block of the given type
  • link(key,type) - increments the refcount of the referenced block
  • data unlink(key,type) - decrements the refcount of the referenced block, and deletes it if it's now zero, but returns its content (otherwise, it returns some sentinel nil value).
  • tag(name,key,type) - updates the named 'tag' to point to the given block. The name can be any string. This operation creates tags if they do not already exists, otherwise updates them to point to the new content.
  • key,type get_tag(name) - return the value of a tag
  • name[] listtags() - lists all tags
  • rmtag(name) - deletes a tag

The front-end will use the back-end, wrapping it with compression and encryption of every block; symmetric key encryption, unlike duplicity's public key system; I'm not publishing my backups, so there's no need for the complexity of a public key system. I want to be able to restore a backup knowing a base URL and a passphrase I can memorise, not needing to store a keypair somewhere and not lose it!

Then it'll do the usual Venti-esque trick of storing small files as direct blocks (of type file), large files as sets of subfile blocks with a tree of indirect-file blocks pointed to from an indirect-file superblock, then directories stored as lists of file names, metadata, and pointers to the file or indirect-file blocks, encoded in a similar fashion. The block keys will, of course, be the hashes of the compressed and encrypted blocks. Not SHA-1, something more trusted these days.

Ideally, the format of every block apart from raw data blocks should be in some introspectible format, such as YAML, so that it's easy to enumerate every reference to another block within them, without explicit knowledge of all the types and their encodings, in order to simplify the recursive unlink strategy I describe later, and so that integrity checkers that follow the tree making sure the reference counts are correct can be easily and reliably implemented.

We can store a local cache listing the keys and types of blocks we've uploaded, so that the put operation can quickly and efficiently return when doing a duplicate block upload without needing to talk to the remote server - but if there's a miss in the cache, it still needs to check the backend store for duplication rather than blindly uploading the block, and then update the cache so it doesn't make the same mistake again. That way we can delete or lose the cache, and things will be correct but slower, but will slowly speed up again as the cache relearns.

We could also cache the tags, and the snapshot blocks, since they are all small.

As a major optimisation, we could handle uploading a file by first of all computing the entire tree in memory, to compute the hash of the actual file block that forms the root, containing file metadata and the link to the data block, or indirect tree of blocks pointing to the data blocks; then just check if the file block already exists in the system, and if so, just reuse it.

Of course, we need to do this for every file in a directory (including, recursively, directories) in order to see if the directory itself already exists - and only increment the reference counts on the unchanged files if there is a new directory object being created; if we reuse the directory object, then we are not creating a new reference to any of the files, so their refcounts remain unchanged. We could also optimise this process somewhat by storing a local cache of file mtimes and the hashes of their file blocks, so that a file that has not changed can be spotted purely locally, and reused.

In effect, the snapshot algorithm, in order to get the reference counts right, will need to do a top down traverse of the filesystem to find the modified files; then recurse back up updating modified directories, incrementing the refcounts of any unmodified objects within the modified directories.

Then we can provide the following operations:

  • Archive a given filesystem tree to a given data store - then store the pointer to the root of the resulting directory in a block of type snapshot with the snapshot metadata (source hostname and filesystem path, date and time, optional comments); then update a specified tag to reference the new snapshot - not forgetting to include the old tag reference in the snapshot metadata when writing it, to form a history chain.
  • Given a tag name, look it up to get the snapshot reference, or use a supplied raw snapshot reference; list its metadata, list its files, or extract any given files matching a regexp.
  • Given a tag name or snapshot reference and a search query (date range, comment regexp, etc) follow the chain of snapshots to find the one that matches the criteria
  • Given a tag name, look it up to get the snapshot reference, or use a supplied raw snapshot reference; and verify it; recursively fetch every block, checking that what we get back matches the hash and decrypts correctly with the supplied passphrase to a valid block.
  • Delete a snapshot and its ancestors, by unlinking it; and if the refcount then becomes zero, so the contents of the snapshot is returned, then decode the block returned and recursively delete every block it references. The reference counts will ensure that any subtrees shared by other snapshots are left intact.

The tag, if not explicitly specified, would default to the hostname plus the root path of the filesystem being archived.

Also, another feature I'd like, continuing from my thoughts on backups and archives, is for the system to process .ignore files of some kind (like .cvsignore, .gitignore, etc) - eg, files that can be placed in directories that specify regexps against which files in that directory or beneath are checked, and if they match, are not included in the snapshots.

This is so that I can have a top-level .ignore file that masks of /dev and boring bits of /var and so on, while users can then mask out things like my directory of OS install ISOs; I can keep the .torrent files therein, while not bothering to archive the many gigabytes of .iso files, in a fine-grained manner with localised control. My duplicity and other rsync-based backup systems have a horrendously long list of --exclude=... flags on the command lines to do this by hand from a single central location. Yuck.

I predict I can implement that in two days - maybe I'll get a chance over Christmas?

UPDATE 2008-12-15: Automatic increment of the reference count on a duplicate put is naive. And I've described the unchanged-file optimisations. UPDATE 2008-12-18: Local .ignore files

Pages: 1 2

3 Comments

  • By Improbulus, Mon 15th Dec 2008 @ 4:44 pm

    Whew! I'd have thought you'd have wanted to have a break at Xmas, but I guess building a backup system is as good as break to you, innit? 😛 🙂 Good luck - but do make sure you have some kind of a rest, at least! unlike those of us who'll have to swot..

  • By Faré, Mon 22nd Dec 2008 @ 12:31 am

    Congratulations for salvaging the system.

    Speaking of backups, are you satisfied with duplicity? What else did you try? what made you choose it over competition? Would you recommend it or try something else?

  • By Faré, Mon 22nd Dec 2008 @ 2:19 am

    Oops, hadn't read page 2.

    Yes, I'd like such a system. But why limit it to backups? If you standardize on a small-enough bucket size (say, 64KB to a few MB) and have a general content-addressed mechanism (say, using Tiger Tree-Hash - SHA-1 is broken!), then your protocol is also perfect as a storage backend for file-sharing (Gnutella also uses TTH).

    Do you have working code already?

Other Links to this Post

RSS feed for comments on this post.

Leave a comment

WordPress Themes

Creative Commons Attribution-NonCommercial-ShareAlike 2.0 UK: England & Wales
Creative Commons Attribution-NonCommercial-ShareAlike 2.0 UK: England & Wales