Ugarit Roadmap (by )

I've not had much time to hack on Ugarit lately, which is a shame - but just to keep you enthusiastic, here's my current roadmap.

In no particular order:

lzmalib has to go

I wanted to support a range of hashing and compression algorithms out of the box, and LZMA looked like a good high-compression algorithm for when using remote storage - where bandwidth and storage costs are likely to be significant, and link bandwidth limited enough to make heavy compression CPU costs worthwhile. So I looked in pkgsrc and found lzmalib, an implementation of LZMA which I promptly wrapped up as a Chicken egg and used.

However, few systems seem to have lzmalib installed, as it's not been widely packaged. And I've since found that even in pkgsrc, it conflicts with the xz package, which provides another implementation of LZMA - and is more widely used.

So I ought to write a Chicken egg that wraps xz, and move over to using that more widely-available library.

mtime caching

Ugarit goes to great lengths to be efficient in its use of bandwidth to the storage system, and space thereon, by dividing the filesystem to be snapshotted into chunks, hashing them, and using the hashes to see if that chunk already exists in the archive (even if it's part of a previous snapshot, or a snapshot of an entirely different filesystem, or a duplicate file or part thereof within the same filesystem), and to re-use it if possible.

However, this means that whenever a filesystem is snapshotted, the entire filesystem is hashed - just to find out which bits of it have changed since the last snapshot. This takes time proportional to the total size of the filesystem, and lots of CPU for all that hashing.

The solution is simple - store a local cache of the top-level hashes of files, indexed on the path to the file, its mtime, and its size. When considering each file, if its path still maps to the same mtime and size, we can assume it's not been modified since the last snapshot (alas, there is a small race potential if the file was modified in a size-preserving way twice in the same second, with the previous snapshot falling between the two modifications, but that's livable with). If so, just check if that hash already exists in the archive, and if it does, re-use it, without having to hash the file. If the file has changed, or its hash is not found in the archive, hash it as usual.

This will reduce the cost of doing a snapshot from the cost of hashing the entire filesystem to about that of running a find(1) over the filesystem, which is quite small.

Remote archives

Currently, Ugarit has a number of "archive backends" that store the archive in a number of different formats, but they all boil down to reading and writing files on the local disk. Clearly, it'd be nice to do backups to a remote archive.

What I'd like to do is to decouple the backends into separate executables that talk a simple binary "backend protocol" over standard input and output. The Ugarit itself would just spawn a specified backend command process and talk to it in order to access the archive. However, it would then also be easy to use ssh to securely run that command on a remote server, in order to access remote archives.

I'd still like to write an Amazon S3 backend, and backends for other suitable protocols like HTTP GET/PUT and Venti, of course. But backends-over-ssh would remove much of the need for the sftp backend.

Clustered backends

Remote archives would be nice, but even remote archives can fail. It would be nice to have a cluster of remote archives. One could imagine mating cheap single board fanless computers to USB mass storage arrays and placing them all round your offices, to create a distributed archive, with redundancy.

It would be easy to write a Ugarit backend that proxied a cluster of physical backends to make this happen. It would have a configuration file listing a number of physical backends, each annotated with a trust percentage and a load weighting, the uses of which will become apparent.

The idea is that it stores every uploaded block on enough backends that the total trust percentage of all the backends used is at least 100%. So a single 100%-trusted backend would do, or two 50%-trusted backends, or ten 10%-trusted backends. But for every uploaded block, it would pick the backends randomly, with the randomness weighted by the load weightings of the backends: eg, with two backends with weightings 2 and 1 and 100% trust, the first one would receive two-thirds of the blocks, and the other one one third. The load weightings can be used to direct the flow of uploads to particular backends; a full backend can be given a load weighting of zero to prevent any more blocks from being uploaded to it.

The proxy would keep a cache of what blocks were sent to which backends, of course. So when asked for a block, it would consult the cache to find which backends have it, and pick one at random to read from. If that backend fails to provide it, then other backends from the cached list would be tried in turn.

If none the backends referenced by the cache have it, or the cache knew nothing about that block, then the proxy would try all backends to try and find the block, before giving up and reporting failure. Either way, whatever it finds out would be written into its cache (even failure, although there should be an option to flush failures from the cache if a crucial missing backend is restored after a temporary outage).

This would be quite awesome.

Backend transfer

It would be nice to be able to easily migrate an archive from one backend to another. This could be done by walking the chains of snapshots dangling from each tag in the source archive, transferring blocks to the destination archive. It might also be desirable to offer the ability to filter them, perhaps only transferring the chains from named tags, and perhaps only a specified number of snapshots back, to filter out old unwanted things.

This would be a useful management tool when combined with backend clustering, as it would then make it easy to rebalance the cluster when more redudancy is required - or less redundancy, or when replacing a backend.

I'm not yet quite sure what to do about tags in this system, though. Replicate them to all backends, for safety? How to deal with conflicts, then, as tags are mutable? Introduce timestamps?

FUSE plugin

Right now, Ugarit lets you read the contents of archives using a command-line interface, and then select things for extraction.

It would be nice to write a FUSE plugin interfacing to this, so you can just mount it as a read-only file system. The top levels of this file system would be the tags in the archive, and the dated snapshots beneath them; but these would then be the root directories of snapshotted filesystems, yours for the reading.

Archival mode

Right now Ugarit focusses on providing a backup service, by snapshotting file systems. However, I do have other plans for it: I also want it to work as an archival system. In particular, I want to transfer things like my MP3 collection, my digital photos and videos, my library of interesting PDFs, past projects I have completed, and other such archival material into it. Then I would actually access all this stuff via the FUSE interface, making Ugarit my primary storage for archival material rather than a place to put snapshots of it.

My proposal is to have two kinds of tags in archives - the existing ones that point to snapshot history chains, and ones that point to archive indexes. One can then tell Ugarit to "archive" a subdirectory of a filesystem into a given index, along with a set of key-value metadata describing it. The index tag actually points to a B-Tree index inside the archive which indexes all of the metadata keys; when the index is modified, copy-on-write semantics are used, so some old blocks are superceded by new blocks, whose parents are then superceded by new versions pointing to the new hashes of the children, all the way up to a new root. If the backend supports unlinking, then the superceded blocks may optionally be unlinked once the index update is complete.

Having done that, the index tags would appear in the command line interface or the FUSE interface as directories with a rather different structure to those coming from snapshot tags: rather than containing all the dated snapshots plus a symlink called "latest", they would contain a subdirectory for each metadata key that has ever been used in the index (name, type, client, archived-on, etc). Each of these then contains a directory that is the root directory of every archived directory tree using that metadata key, named after the value of that metadata key used with a dot and a number appended. The number is appended since there's nothing stopping more than one archived subtree using the same value for that key (type being likely to be a popular one). However, if there is only one matching value, we also present a symlink from the un-numbered value to the numbered directory, so for unique keys we can just use them directly. Also, each index key would have a special top-level directory alongside the metadata key names, called search; one can construct paths of the form search/<key>/<value>/<key>/<value>/.../results, or interactively explore the directory tree to see what keys and values are available at each level (and peek in the results directory at each level to see a list of matching archived subtrees), to find subtrees matching all the <key>=<value> constraints in the path.

The root directories of archived subtrees would contain, as well as the actual archived files, a file called .ugarit-metadata which contains a file for each metadata key stored with the subtree, containing that metata key's value.

Having done that, if I mounted my archive onto /ugarit and archived my personal stuff onto an index tag called personal, then I could use paths such as:

  • /ugarit/personal/name/Pink Floyd: The Wall ...to find an archive of the MP3s of Pink Floyd's album "The Wall"
  • /ugarit/personal/artist/Pink Floyd.* ...to find all my Pink Floyd albums.
  • /ugarit/personal/search/type/project/client/FooCorp/results ...to find all the projects I did for "FooCorp".

This would be extraordinarily useful. Combined with backend clustering, it would also give me an awesome fault-tolerant replicated shared filesystem for my Stuff.

That's all, folks

There's a lot to do. I'll get there, but only slowly. Volunteers are most welcome 🙂

3 Comments

  • By Ben, Wed 14th Apr 2010 @ 8:47 am

    Have you considered using a userland (FUSE) implementation of ZFS or other snapshotting FS with storage onto remote servers? It would give you most of what you want.

    Make sure you can use FS snapshots, where supported. And later versions of ZFS can tell you what changed between snapshots. Making plugins for this kind of task would be quite useful.

  • By @ndy, Wed 14th Apr 2010 @ 10:55 am

    But for every uploaded block, it would pick the backends randomly, with the randomness weighted by the load weightings of the backends:

    This worries me from a restore point of view. Surely you want all the data needed to restore a single snapshot in at least one discreet place so that restores are easy: either because you only require one location to be reliably connected to the net at the time or because you can just go to the remote location and fetch the disk.

  • By alaric, Tue 20th Apr 2010 @ 8:29 am

    Ben: If I rewrote the storage management layer in ZFS so it worked in append-only archival mode, and rewrote the file hiearchy management layer so it supported indexed archives, then it might be useful - but what would be left?!?

    @ndy: This scheme will happily let you fully replicate, by just making sure the sum of the trust levels of all your backends is 100% - so the only way to get full trust is to put every block everywhere. And what I think you're proposing - picking one backend as the target for each snapshot - won't work unless you also restrict reads to just that one backend for the duration of the snapshot - or the snapshot sent to that backend will refer to 'shared' blocks on other backends, so you won't be able to restore from just that backend. If you want that kind of semantics, just have a lot of backends and pick a different one for each snapshot; Ugarit doesn't need a clustered backend manager for that! The idea here is to manage a pool of storage, so you can just add extra backends in with more disk space when you start to run low, while having redundancy due to replication.

Other Links to this Post

RSS feed for comments on this post.

Leave a comment

WordPress Themes

Creative Commons Attribution-NonCommercial-ShareAlike 2.0 UK: England & Wales
Creative Commons Attribution-NonCommercial-ShareAlike 2.0 UK: England & Wales