Venti: Append-Only Storage Management (by )

Check out this PDF:

http://www.cs.bell-labs.com/sys/doc/venti/venti.pdf

It's a description of a new mass storage system designed for Plan 9. Venti offers permanent archival storage, with basically two operations:

  1. Given a block of up to 52KB of data (and some metadata, which is fairly irrelevant), the system returns an SHA1 hash of that data
  2. Given an SHA1 hash, return the previously stored block of data with that hash.

This is a very simple interface, which can be implemented quite efficiently. The key thing is that if you save the same block twice, the system only stores one copy - since it uses SHA1 hashes as lookup keys, the attempt to write a new block with the same SHA1 will be handled by just returning the same SHA1. The odds of two blocks having the same SHA1 are much less than the odds of all sorts of more worrying problems.

Yet it can be used to do very useful things. For a start, things larger than 52KB can be stored by breaking them into chunks and storing the chunks, then making a (much smaller) file of just the SHA1 hashes, and storing that file. If it's larger than 52KB then it, in turn, can be broken into chunks, until you end up with less than 52KB of hashes, which you can store in a single block. Some of the metadata stored can be used to decide if the block is a list of hashes of other blocks, or data; or you can append a "magic number" to the start of the block.

So they wrote a pair of tools called vac and unvac that are used like gzip and gunzip. vac takes a file, stores it on the Venti server (possibly as a tree of hashes if it's large), and replaces it with a small file containing the SHA1 hash of the root node of the tree. unvac takes such a file and fetches the referenced file back from Venti. It also handles directories by storing the files, then storing a list of files and their metadata.

The fact that the same data block will only be stored once is very neat, and nullifies the main immediate problem of an append-only archival service: it'd keep growing. It turns out that even if you do "naive backups" and just vac the entire filesystem every night, new space will only be allocated for files that have really changed. You can save some network I/O by comparing the files on disk to the hashes stored in the already-archived directories and not even bother sending them back to the server if they've not changed, but it doesn't make it any more space efficient.

The super neat thing is that they then built a file system that sits in front of Venti. It's a normal read/write file system where overwritten data stays overwritten, but it supports snapshotting the state of the filesystem for later access - and these snapshots may optionally be make permanent, to a Venti server. In which case the data is all swept from the file system and replaced with references to the permanent copies, with a "copy on write" strategy; if somebody tries to modify it, a copy is made to the file system and then modified. However, things that never really change will be swept away to Venti and live there quite happily, not taking up space in the file system.

The fact that the file system is therefore swept of static data will, I think, make certain optimisations a lot easier... I need to think about how something like that might impact my TUNGSTEN file system design, which had a downside in that static data would keep being reshuffled to compact free space. Hmm...

Now, I currently handle backups by doing dumps from my servers, over ssh, and storing them on a removable USB hard disk that I normally keep locked away. I do level 0 dumps a few times a year, then stack up level 1 dumps until the disk gets a bit tight, then do another level 0 before removing the old level 0 and level 1s. It'd be much nicer to instead have a permanent Venti store on my removable disk; with a log of the hashes of the root block of the saved tree at each dump, I'd be able to easily pick out the state of the system at any past time without messing with level 0 and level 1 dumps. In effect, everything would be an incremental dump, saving me lots of bandwidth...

I must talk to Ben about ; it could probably do something similar, although my access pattern isn't what it's designed for - it has a constantly available backup server with the mass storage that clients connect to when they need to dump, while I want to keep my backup disk offline whenever possible, and have the machine with the backup disk contact my servers to fetch a dump...

1 Comment

  • By andyjpb, Sun 15th Jan 2006 @ 12:15 pm

    git (The new content manager / SCM by Linus Torvalds) uses a similar idea for its backend storage. Each file committed is stored (compressed IIRC) in a file names with it's SHA1 ID. There are then files (objects) that contain lists of SHA1s to model directories and trees. There are then commit objects that associate a tree with it's commit message, etc. Once an object has been committed to the repository, it stays there forever and any other files with the same contents end up referencing the same object. It also means that parts of the repository can be copied onto read only media every now and then.

    As for backups, I use a modified version of this: http://www.mikerubel.org/computers/rsync_snapshots/

    On the first backup the contents of your tree are copied. On subsequent backups the backup tree is hardlinked to itself. Then rsync is used in such a way that, for files that have changed, it unlinks the version in the new backup tree and then copies the new version across. It's similar to the traditional backup method: take a level 0 backup and then do incremental backups however, your latest backup is always the "full backup" and the others are the increments, traveling back in time as opposed to forward. One of the downsides is in terms of redundancy;, the whole thing only counts as one backup as there is only one copy of each version of a file on the disk.

Other Links to this Post

RSS feed for comments on this post. TrackBack URI

Leave a comment

WordPress Themes

Creative Commons Attribution-NonCommercial-ShareAlike 2.0 UK: England & Wales
Creative Commons Attribution-NonCommercial-ShareAlike 2.0 UK: England & Wales