Folding history (by )

Ugarit is a content-addressed store; the vault is a series of blocks, identified by a hash, that cannot change once they are written.

But logically, they appear as a set of "tags", each of which either points to an archive (a set of files with associated metadata, which can be added to, or the metadata of existing files changed) or snapshots (a chain of snapshots of a filesystem at a point in time).

So in a store where objects cannot be modified, how do we create the illusion of mutable state in these "tags"? Well, in the physical vault storage, tags are named pointers to blocks; the vault consists of a series of blocks, identified by their hashes, and a series of tags, identified by names and pointing to a block; and the tags can be updated to point to a new block.

Snapshot tags are the easiest to explain. Every snapshot is represented by a block, which contains snapshot metadata and a pointer to the block representing the root of the directory tree in the filesystem snapshot. But a snapshot block also contains a pointer to the previous snapshot block. The snapshot tag just points to the most recent snapshot block - so the entire chain can be found by following the pointers from each block to its predecessor, until you reach the first snapshot, which has no pointer to a previous snapshot.

Archive tags are a little more complicated. An archive is built up in a series of imports, each of which assigns metadata to a bunch of files. So we store a sequence of archive blocks, each of which represents a single import - and which also has a pointer to the previous import's archive block. The list of objects in the archive can be found by walking the chain and adding up all the objects, and keeping the metadata from the most recent import, as later imports can override the metadata in earlier imports.

This means that the entire history of an archive is preserved; we can look back and see what it was like at previous points in time, by just starting from some previous import rather than the most recent.

As such, both archive and snapshot chains are known as "histories" in the Ugarit internals, even though the contents of them differ; they are both a sequence of objects, each with a creation time and a pointer to a previous version that has an earlier creation time. In order to present that sequence to the user, we provide an internal interface to "fold over the history", with the name "fold" being functional programming jargon for a way of traversing linear sequences.

But we don't always just have a linear sequence of things in histories, and there's where it gets interesting. Snapshot tags can diverge; if a server is cloned, for example, then the two resulting instances of it can have their own diverging histories that happen to share the same origins. This is done by just duplicating the tag - creating an extra tag pointing to the same snapshot block as another tag - and then doing snapshots onto both tags; the resulting chains of snapshots hanging off of the two tags, at some point, converge on the "common ancestor" snapshot and are the same thereafter.

This could happen with archives, too; one might decide to, in effect, take a copy of an archive, by duplicating the tag, and then those two copies can have a different history from that point onwards. This can be useful to record a point in history, by making a duplicate tag that you never update, but can go back to to see the archive as it was then. Or I might give a copy of my archive to somebody, for them to re-organise to suit their tastes. And so on.

But this doesn't complicate the software much; I need to provide an operation to duplicate a tag, but once that's done, each tag still points to a simple linear chain of snapshots or imports that goes back to an origin.

However, recently, I've added a new facility to Ugarit that complicates matters somewhat; I've made it so, as well as "forking" a tag by duplicating it, you can also "merge" two (or more) tags into one.

This is useful for a few reasons. For archives, it makes a lot of sense - my wife and I might have our own music archives, and decide to merge them into one large shared one, for instance. And even for snapshots, it makes some sense; we might clone a server, as discussed above, and thus have represented the diverging filesystems by forking the tag. But we might decide we don't need the increased capacity of two servers, and decide to merge them back again, using rsync to ensure that all the changes from one are applied back to the other, and then shutting it down. We can represent this in the snapshot history of the resulting server by merging the two snapshot chains back into one again.

But either kind of history might have unwanted forks in them, caused by network partitions. I plan to create a replicated vault system, which allows a vault to exist across a number of physical storage servers - with resilience, so the vault still works in the presence of network partitions and server failures. In such a case, two different users might import onto the same archive tag, but while on opposite sides of a network fault, causing the archive tag to "diverge" into two. The replication system will notice this when the network heals, and represent it by forking the tag, so both diverged states are preserved. The fix will be to then merge it back again.

Under the hood, this is simple - we just create a snapshot or archive block that has multiple previous pointers. For an archive, the simple "closest import to the current state" rule for finding which metadata for a file is "current" is made a little more complex; when we have a merge, we need to list the tags we are merging in a priority order, and updates from higher-priority inputs override any from lower-priority inputs when the same object exists in both.

But it's a little trickier with snapshot tags, which are viewed purely as a linear sequence of timestamped snapshots. And, similarly, an archive tag can also be viewed as a "history", a sequence of timestamped imports, in exactly the same way (mainly for auditing purposes). Given that the history of a tag might now be a complex network of forks and merges, not unlike a river that splits and joins as multiple streams join into a raging torrent that spills into the nearest ocean, how do we turn that into a linear sequence in timestamp order? How do we represent that complexity to the user?

Well, as it happens, I had an idea.

The obvious simple option is to just take all the imports or snapshots in the entire history, and sort them by timestamp, and spit them out in order. That at least gives you all the history entries in order, but it hides the forking and merging. Also, there might be thousands of them (I have nightly snapshots of one of my servers going back years, which is currently a history eight hundred and thirty-six snapshots long) that need to be loaded into memory and sorted, just to find the most recent few.

So, I came up with a better way. At heart, it's a merge algorithm - because although any history chain may split into multiple ones as we go back in time through merges, each chain is still in order already, so we just need to "merge" them in order to produce a sorted output. As we walk through the history, we keep a list of current "threads" we are examining, which starts off as just one (pointing to the single entry pointed to by the tag). At each step, we look at the block at the head of each thread, and pick the most recent one, and output that; we then see how many previous entries it points to (be there zero of them because it was an original entry, one of them for a traditional single-parent entry, or more than one because it was a merge). If there are no parents, we delete the thread; if there is one parent, we move the thread along to that block; and if there are more than one parent, we need to create additional threads for them, replacing the original thread.

So how do we detect when there was a fork? As written, if two snapshots have the same parent pointer, then our two threads will both continue into the common parent chain and duplicate the snapshots thereon. We need a way to detect when two threads have converged into a common chain, which we do by noticing when we now have two (or more) threads pointing to the same block after we've performed the previous step. When that happens, we remove them from the thread list, and replace them with a single parent thread.

So the result of the above algorithm is a nice sorted list of history entries, but without needing to load them all into memory and sort them, which is nice and efficient. But it still doesn't help the user by letting them know when forks and merges happened.

But we detected forks and merges as we followed the history - so we can reflect them by giving numbers to the threads, and as we output entries, stick a suffix on the entry name (its timestamp) indicating which thread it came from. If there's only one thread, we don't bother, as there's nothing to distinguish. And rather than using the arbitrary index of the thread in the list (which will change for a thread as other threads come and go, and be re-used by other threads later), whenever we create a new thread, we'll give it a new sequential number and use that as a suffix. That way, the user looking at a snapshot history will see when it diverged and converged, giving them some idea of what went on.

Except the algorithm as given will sometimes give ugly results; if we allocate suffixes when we create threads, but show them to the user in timestamp order, they'll sometimes appear in funny orders (which will often depend on what order the previous entries of an entry were listed, which can be arbitrary in the case of snapshot chains). Instead, we can be lazy about actually allocating suffixes - and only give a thread a suffix when it is first chosen to supply the next entry in the history. Only at that point do we need to display the suffix to the user, so if we allocate them then, whenever the user sees a new suffix, it will be the number above the previous suffix they saw, which is nice and neat.

Anyway, implementing that in Scheme was quite fun; as usual, a fairly complex algorithm wasn't many lines of code, and has been written with very few bugs cropping up. I wrote it as a generic algorithm that accepts an abstract interface to the underlying storage, so it can be applied as-is to either snapshot or archive histories, or to synthetic histories used in the unit test suite. See the code, and take a look at the unit tests to see how they exercise it, and see some examples of its work!

No Comments

No comments yet.

RSS feed for comments on this post.

Leave a comment

WordPress Themes

Creative Commons Attribution-NonCommercial-ShareAlike 2.0 UK: England & Wales
Creative Commons Attribution-NonCommercial-ShareAlike 2.0 UK: England & Wales