fold>cons (by )

Or should I say, (> fold cons)

In a way, I'm sad that I started my life programming imperatively rather than functionally. While I feel that imperative programming is still the most appropriate for some problems (particularly ones that really do look like state machines), and logical programming is great for dealing with big blobs of knowledge (such as databases), most programming tasks, I think, are best handled functionally.

But imperative programming is in fashion; it's what people are taught, since it's what people use, and it's what people use, because it's what they were taught, and lots of feedback loops maintain that status quo; there are more tools for imperative programming, and more books, and so on. Lots of people think functional programming is somehow mind-bending and difficult, but that's just because they don't know it well, and the first taste of something unusual often seems bizarre and worrying.

So, although I find functional programming more efficient and pleasant than imperative, I don't get to do very much of it. And, therefore, I'm really still learning it.

While working on Ugarit, in particular, I learnt a new trick. And that trick was the true usefulness of fold.

fold is a function in Scheme and similar languages that is used to apply a supplied function to every element in a list, but rather than making a new list of the results like map, the result of each function call is just passed into the next one. In order to get the process started, you have to supply an initial value to pass into the first call.

Here's what it'd look like in C, assuming you were dealing with an array of ints:

int fold(int *int_array, size_t array_len, int (*kons)(int element, int acc), int knil) {
  int acc = knil;
  while (array_len--) {
    acc = kons(*int_array++, acc);
  }
  return acc;
}

The names kons and knil are hints as to what they do; hints that I should have noticed, but didn't. But more on that later.

Now, you might use that to add up the elements of an array, by making a function that adds its two integer arguments and returns the result, and passing it to fold with 0 as knil. Or multiply them all together. Or something.

Now, I rarely have to add up the elements of lists, so when I read about fold in the Scheme specification, I thought "Meh, I'll remember that when I next need to add up the elements of a list", and mentally discarded it into the pile of barely-useful tools.

My mistake was that I was thinking of fold as a function to apply a function to every element of a list. Which is rather the wrong way round. fold is really a mental model of a sequence, which the function fold is just an implementation of for lists. It should be called list-fold.

What made me realise this was reading the interface to Alex Shinn's gdbm egg. It's a wrapper to the gdbm library, which provides file-based persistent key-value mappings.

One of the operations the gdbm library gives you is to iterate over the whole database. There's a gdbm_first function to get the first record, and a gdbm_next function to get the next one. Call gdbm_first, then keep calling gdbm_next until you stop getting records. That's an imperative interface, as it involves having a "cursor" that each call to gdbm_first or gdbm_next moves around; it's all about altering state.

But in the gdbm egg, Alex provided a more functional way of doing that. He provided gdbm-fold, which accepts a gdbm database handle, a kons function to call on each record in turn (along with the return value of the last call to kons), and a knil value to get the process started. It returns the result of the final kons call.

Now, the interesting thing is that if you just want a list of records, you can pass in an empty list for knil, and for kons, a function that takes a record and a list and returns a new list starting with that record and then continuing with the list passed to it (eg, taking that list and sticking the new record on the head of it). Indeed, where it not for that fact that gdbm-fold calls kons with three arguments - the key of the record, the contents of the record, and the result of the previous kons - you could pass in the standard Scheme function cons as kons, and the empty list '() as knil. Which is where the names kons and knil come from - '() is pronounced "nil". cons is an acceptable kons function.

Indeed, you can use traditional fold to copy a list, by passing in cons and '(). It'll come out in reverse order, mind, since fold will start at the beginning of the list, and the first cons will join the first element of the list onto '() to get a one-element list, and the second cons will join the second element of the list on top to end up with a list that has the second element then the first element, and so on, until you get the entire list reverse.

But this is the interesting thing about fold. Folding lists isn't actually all that interesting. What's interesting is that anything list-like can have a fold operation.

See, when working on Ugarit, I came across various sequential structures that live within archives. When a file is chopped into blocks and stored, the entire file is referred to by a reference to a list of references to all the blocks, that lives in the archive. And a directory is a list of directory entries. And a tag refers to a chain of snapshots going back through time.

When I wrote my first of these structures - the list of references to blocks, which I called a key stream - I needed a function to get a key stream back out of the archive so the blocks of a file could be brought out in order and written back to disk in an extraction operation.

My first instinct, of course, was to write a function that would recurse over the tree structure that's used to store a key stream within the archive (because if a key stream is too large to fit in a block, it needs to be split into blocks, and then another higher-level smaller key stream used to collect that list of blocks...), building up a list in memory which it would return. The problem is, each of the blocks that represent a file represent a megabyte of data; a file that's a terabyte long would involve a million keys, and building up a list in memory to hold them all struck me as fundamentally wrong. gdbm-fold let you iterate over the entire contents of a database without loading all of it into memory.

So I wrote fold-key-stream, a function that takes a reference to a key stream, a kons function that is called on every key in turn, and a knil value to get the ball rolling. Having seen gdbm-fold, I had a hunch this was the right approach. I expected to have to perform a mind-bending turning inside out of my simple tree walk, but when I sat down and did it, the right answer fell out easily:

;; kons is called on (key type accumulator) for every key in the stream, in order
(define (fold-key-stream archive key ks-type kons knil)
   (define type (archive-exists? archive key))
   (if (eq? ks-type type)
      ; Recurse
      (begin
         (define subkeys (deserialise-key-stream (archive-get archive key)))
         (fold 
            (lambda (subkey acc) (fold-key-stream archive subkey ks-type kons acc))
            knil
            subkeys))
      ; Leaf node
      (kons key type knil)))

The function asks the archive for the type of a block, given that block's key, to see if it's followed the tree all the way down to the leaves or not, since the key stream blocks will all have a particular type (ks-type) while the leaves - the actual blocks the key stream points to, such as file data - will have som different type. So you could call fold-key-stream on a block that's not even a key stream block, and it'll treat it like a key stream with just one element in, that block.

So the code starts by checking the type of the block. If it's not a key-stream block, it's a leaf block, so we just call kons on it, passing in knil since this is the only call to kons, and return the result. Job done.

But if it's not a leaf node, then we go into the (begin ... bit. The first thing we do is to read the key-stream block and convert it into a list of keys, which we call subkeys. And then we get funny, by calling that useless fold function for summing up lists on it. Now, we have a list of keys, which may be actual keys to call kons on, or keys of further subtrees. But we already know how to handle both those cases; if it's a leaf node, fold-key-stream would just call kons on it, and if it's a list of further subtrees, then each of them is actually a smaller key-stream in itself, and fold-key-stream can deal with them... so we just wrap fold-key-stream in a little wrapper function that adds in the extra parameters, such as the reference to the archive, and use fold to call it on every subkey in the block. fold provides the basic plumbing for us, of actually calling the function and passing in the accumulator parameter. The clever part is that each call to fold-key-stream is passed the knil from the previous call; we're recursing folds within folds to walk down a multi-way tree, but the chain of passing knil into kons is preserved.

So really fold is more like a functional foreach operation: the fun isn't in calling it on lists, but in providing fold operations for all sorts of sequential data structures. Since then I've written several fold operations for various data structures within the archive. Indeed, one of them is a high-level ls operation on nodes within the archive, from the root that lists all the tags in the archive, to each tag that has a list of snapshots, to the directories and files within the snapshots. I will use this with a kons function that prints out each entry in turn, pausing for a keypress after every twenty or so, and return to a store continuation (sort of like a longjmp in C, or throwing an exception) to break out of the fold if the user presses 'q'.

In practice, any function that builds up a list and returns it probably ought to be recast as a folder; this can be done simply by letting the user provide their own cons and '() (let's call the kons and knil...) rather than using cons and '() to build a list up yourself. Then the user can get a normal list if they want, or can directly process the elements in turn, rather than getting a list from your function that they immediately strip down.

So, a fold function is, in effect, a way of representing a list. Or, at least, a potential list, that doesn't exist until you invoke it. Which leads one to think of defining list operations like map over them...

Ugarit (by )

The core of Ugarit is ready. Now I just need to wrap some UI around it...

However, the code I have will now archive and restore directory trees!

Take a look. Here's the input:

    -bash-3.2$ ls -l test-data
    total 4
    prw-r--r--  1 alaric  users     0 Jan 15 20:52 FIFO
    lrwxr-xr-x  1 alaric  users    18 Jan 15 03:02 LICENCE.txt -> subdir/LICENCE.txt
    -rw-r--r--  1 alaric  users     0 Jan 15 02:44 README.txt
    brw-r--r--  1 alaric  users  0, 0 Jan 15 20:52 blockdev
    crw-r--r--  1 alaric  users  0, 0 Jan 15 20:53 chardev
    drwxr-xr-x  3 alaric  users   512 Jan 15 10:01 subdir
    -bash-3.2$ ls -l test-data/subdir/
    total 4
    -rw-r--r--  1 alaric  users  1527 Jan 15 02:44 LICENCE.txt

...and here's the output. Note the file modification times are preserved, the FIFO and the device special files and the symlink and the zero-length file and the subdirectory make it, too. The modtime of the symlink isn't preserved, but that's because there doesn't seem to be a POSIX function to do it - I use utime to set the times, although NetBSD seems to have a utimes/lutimes pair that can do symlinks properly, I don't know how widespread it is.

It does the mode, uid and gid, too; note that the files come out owned as alaric, even though I ran the extract as root.

    -bash-3.2$ ls -l tmp3
    total 4
    prw-r--r--  1 alaric  users     0 Jan 15 20:52 FIFO
    lrwxr-xr-x  1 alaric  users    18 Jan 17 01:03 LICENCE.txt -> subdir/LICENCE.txt
    -rw-r--r--  1 alaric  users     0 Jan 15 02:44 README.txt
    brw-r--r--  1 alaric  users  0, 0 Jan 15 20:52 blockdev
    crw-r--r--  1 alaric  users  0, 0 Jan 15 20:53 chardev
    drwxr-xr-x  3 alaric  users   512 Jan 15 10:01 subdir
    -bash-3.2$ ls -l tmp3/subdir/
    total 4
    -rw-r--r--  1 alaric  users  1527 Jan 15 02:44 LICENCE.txt

Now, give me a moment while I wrap a command-line shell around it and some higher-level management stuff, then run a more serious test, and it'll be ready for initial release...

Deaf, dumb and blind (by )

The scene: I'm happily working away in London. Sarah is at home, and finds that the Internet connection is down. So she resets the wifi AP and the router and so on, but it doesn't get better. So she rings me.

I can't ping the router from afar, and a traceroute ends at an address that I don't think looks like the last hop before the ADSL link from memory, so I file a ticket with the ISP and get on with work.

But there's no reply to the ticket, so I ring, and sit in a queue, which eventually hangs me up.

So I google and find:

186k, Elite Internet, Mailbox Internet Services Go Bust?

Ah. Apparently my broadband ISP has become one with the Force. So Sarah can't get any work done at home, and I'll have to figure something out so that I can get some work done when I go home, and we'll need to join the queue for a MAC code to migrate to a new broadband provider, which we'll have to choose. And I doubt we'll get back any of the money - we paid a year in advance...

Bleargh...

Ugarit update (by )

Jean woke us up at 6am today, wanting to be in Mummy and Daddy's Bed, but I couldn't get back to sleep after this so lay there thinking about things.

I was pondering the cache-deletion issue in Ugarit that I mentioned before.

The cache in Ugarit serves an important function: in order to know if it needs to upload a block of data or not, Ugarit checks first to see if a block with the same Tiger hash already exists. This operation is performed on everything in your entire filesystem every time you do a snapshot (although I'm scheming up some optimsiations to avoid this based on caching the hashes of files by their modtime and all that). If you have a slow remote archive, such as in S3, then each check for existence of a block might easily take a tenth of a second or more - and since the largest filesystem I need backing up will contain at least a million blocks, that'd be over a day just spent on checking what blocks already exist. So the cache contains a list of known-existing blocks, stored in a B-Tree on the local filesystem. Whenever a block is uploaded, it's added to the B-Tree, as we know it exists in the archive. When a check for the existance of a block comes in, we check the cache; if we see it, then we know instantly that the block exists. Otherwise we check the archive, and if the block was found, we mark this fact in the cache so that we can use it in future. And, of course, if we are asked to unlink a block, we pass that request to the archive, and if it reports that it finally deleted the block rather than just decrementing its reference count, we remove the block's hash from the cache.

However, if you have multiple computers sharing a single archive - which is a perfectly sensible thing to do, as the shared archive will mean that all the different filesystems being backed up to it will share copies (and thus not waste upload bandwidth) of files that appear in both (such as most of everything outside of /var, /home, and */etc on a NetBSD system) - then deletion with caches poses an issue: if you delete a file, and update your local cache, but somebody else is also using a cache on the same archive, then their cache will not know about the deletion. This is dangerous since it means that cache will then falsely report existence of a block that's not actually in the archive, which means the contents of that block won't be uploaded - and since it was deleted from the archive, it means that block won't be backed up anywhere. Danger, danger!

But as I lay there thinking, a solution came to me.

I should make the cache backend maintain an intent log of deletions it proposes to make. When this log is about to become too big to fit in a block itself, it should:

  1. Upload it to the storage backend
  2. Request the key of the block pointed at by a special **DELETION CHAIN HEAD** tag
  3. Upload a small block containing that key, and the key of the block full of deleted block keys.
  4. Update the **DELETION CHAIN HEAD** tag to point to the new block
  5. Process the actual deletions

That way, it keeps a log of deleted references in the archive, and makes sure it's initialised before doing any actual deletes.

Then the cache just needs to store a local copy of the **DELETION CHAIN HEAD** tag. When it starts up (or wants to do a periodic check) it can fetch the one from the archive; if they differ, then it should follow the chain of deletion blocks, removing them from the cache, until it reaches the deletion block with the key it has stored, or the end of the list, at which point it can stop and update its local copy of the tag.

There's still potential races - another deletion operation running in parallel might race over the **DELETION CHAIN HEAD** tag, although I've tried to keep only a very small block upload within the window between getting and setting the tag - so I've added tag-lock! and tag-unlock! primitives to the storage engine API, to avoid that entirely.

More excitingly, if a deletion is running in parallel with a snapshot, then the cache being used by the snapshot might not realise a block has been deleted and return success right away.

Perhaps I need to extend tag-lock! and tag-unlock! to be a full-fledged semaphore system, so I can maintain archive invariants, such as that there may be N snapshots running or 1 deletion, like a read/write lock. But I don't like locks - doing it without locks would be much better!

Currently, the archive storage engine API won't quite allow the intent log, anyway, since it just has an unlink! function that returns false if the block still has references, and returns the contents of the block if it does not (so that the caller can then recursively unlink! everything mentioned from within that block). So there's no easy way of asking the storage engine we're wrapping the cache around whether an unlink! would delete a block without actually deleting it. But, we can make do without, at the cost of a little less safety; we can instead make the cache store an after-the-fact log of deleted block keys, and just upload them when it gets full.

So, I'm still not sure if we need the complexity of safe deletions. Are frequent deletions actually a good idea anyway? The neat thing about a content-addressed store is that it does work well as a continual archive, as it essentially stores differences. I decided to implement deletion since I know there will be situations where the thought of freeing up a hundred gigabytes will be more inviting than having a year's snapshots from a decade ago lying around; if I don't implement deletion, then users will forever be pestering me about it. So perhaps I should just leave deletion in as is, along with a warning (automatically output when a cache spots its first deletion request of a session) to the user that doing a deletion will invalidate any other caches around the same archive on different machines.

Still, one cheery thought struck me: if you're running a snapshot, and your computer crashes, then you can just start the snapshot again. We'll only update the snapshot tag when a snapshot is completed, so you won't get a partial snapshot; but when you start again, all the blocks you'd already uploaded will not need to be uploaded again, so it'll just pick up about where it left off last time. Excellent!

Backup system progress update (by )

I forgot to mention before... I chose a name for it. Ugarit, after the site of the oldest human archive yet found.

Anyway, I had another few hours at it last night (I think my total's up to about a working day now; one more working day to go before I know if my two day estimate was correct!).

I've implemented two fundamental abstractions. In a limited-block-size reference-counted content-addressed store, a large file has to be stored as lots of chunks, with the keys of the chunks stored in one or more indirect chunks full of keys. If there's too many keys to fit in one indirect chunk, then you need lots of indirect chunks, and in turn, one or more doubly-indirect chunks to store the keys of THEM in, until you finally have a single chunk at the top of a big tree, whose key you can use to refer to the file.

Reference counting complicates this. As you're writing the file's chunks, some may already exist in the archive, and some might not. Those that already exist, you need to up their reference count, so we know they're being used. Those that don't, you upload the actual content to create them.

But if you reuse all of the blocks of a file - because it's not changed - then you'll find you can reuse the indirect blocks, too, since they won't have changed. In which case you mustn't increment the refcounts of the data blocks, since they're still only referenced from the one indirect block, even though that indirect block now has two references.

So, what I've done is to implement a "key stream writer" API. You create a key stream writer and feed keys into it, but along with every key, you feed a 'reused?' flag which is true if the key is a reused one rather than a new one.

At the end, you finish the key stream writer, and it returns you a key for the entire tree, plus a reused? flag.

And it does all the magic internally, incrementing the refcounts on reused keys passed to it if and only if it creates a brand new indirect block referring to that key, and correctly handling refcounts on any indirect blocks it uses. It does this with a lovely recursion; the key stream writer accepts keys into a buffer and flushes the buffer out to a key stream indirect block when it gets full, at which point it creates a parent key stream writer to manage the next "level" up in the tree, writing the key of the newly created indirect block to that writer. I'm quite taken aback by how simple and elegant the code is, for what it does 🙂

Then, on top of that, I implemented an sexpr-stream writer, which accepts sexprs (Lisp data records, in effect) and buffers them, writing them to the archive when there's too much for a single block, and using a key stream writer to track the blocks written. This is to be used to implement directories, which will be a list of records giving details of all the files in the directory. The sexpr stream writer needs to know about keys mentioned in the sexprs, so it can do the same refcount-tracking magic to handle sharing correctly. Luckily, in spite of those requirements, it also works beautifully and is also elegant code!

Then I just need to implement storing files (which should be quite easy; split the file into chunks, check to see if each chunk is already in the archive and upload if not, write keys to key stream), then I can implement storing directories (recursively store files and directories, use an sexpr stream writer to log each file created). Directory storage will need a few cases handled, since I plan to store funny filesystem objects like FIFOs, device specials, symlinks, and so on.

Then I'll need the main driver that does a snapshot of a directory tree, then creates a snapshot object with metadata in and links it into the snapshot chain for a chosen tag.

I reckon that lot will take me half of the next day - then I'll implement deletion (I've been implementing primitive deletion operators for key streams and sexpr streams as I go along, so this won't take long) and extraction (again, I've written fold operators for key and sexpr streams, so this won't take long), then I'll have the core done!

What will remain then? A command-line wrapper with decent option parsing and command line ops to do a snapshot to a tag, retrieve all or part of a snapshot, list snapshots on a tag, delete a snapshot, and delete a tag, which should just be a matter of plumbing. Then I'll be done!

I still think I'm on for two days... but mainly because I'm punting on writing an actual S3 backend for now. That'll take a while, dealing with fiddly Web Services authentication stuff!

Other things I ought to get around to doing at some point are:

  1. LZMA compression. For now, I'm using a deflate implementation for Chicken Scheme that I found, but I would like to wrap liblzma as a Chicken egg, since it's thought to have tighter compression.
  2. Encryption. Right now I have a stub in the code for encrypt/decrypt operations, applied to the compressed blocks, but for now it just returns the block unchanged. I need to either use Chicken's libcrypt egg (which looks complicated) or find a lightweight implementation of a good cipher that I can wrap as an egg.
  3. More storage backends.
    1. A replication backend that wraps a list of backends, doing writes and deletes to all of them, and servicing reads from the first backend that succeeds to provide a block. That way I could back up to a local disk and to S3, and restores would come from the local disk if possible or from S3 if not. Even if I lost part of the local disk in a filesystem corruption incident, anything I lost would still be available from S3.
    2. A generational backend, that wraps a list of backends. Writes are only done to the first, but reads (and deletes) are serviced from the first one that responds. That way, I can fill up one hard disk with archives, then start filling another, and so on, building up a chain of archive partitions.

Care must be taken when doing complex things with backends, though, in order to preserve the correctness of reference counts. A group of archives that are used together by replication or generations will have refcounts that are correct across the whole group, not necessarily individually within each component archive.

More forbodingly, my existing cache backend wrapper won't mix well with deletion of snapshots when the same backend is shared between multiple machines, as it basically caches the existence of blocks; but if a block is deleted by another system, the local cache won't know, and will incorrectly report that the block is already known. Which means it won't get uploaded to the archive, which means data could be lost.

For now I'll recommend that people who want to delete from shared repositories don't use the cache, or flush the caches after a deletion, but in the long run I'd like to think about how to improve the cache. Perhaps by logging deleted keys to a special area in the backend, and when a cache starts up, it can fetch the log and look for any deletions it doesn't already have. Or just compress the entire cache file and upload it when closing the connection, and download it when starting a connection, as long as no two cache-using machines are snapshotting at the same time...

WordPress Themes

Creative Commons Attribution-NonCommercial-ShareAlike 2.0 UK: England & Wales
Creative Commons Attribution-NonCommercial-ShareAlike 2.0 UK: England & Wales