Log structured file systems revisited (by )

However, we can do more lovely things with this file system. We can make it seamlessly able to accept extra disks into the filesystem, to give it more space. Here's how.

Imagine we use 64-bit physical block numbers. Now try taking the top byte to be a device number. Becuase we have no requirement that physical block numbers be contiguous - just that every block number has a successor, so we can advance the clean and allocation pointers - and that they can be numerically compared to see if a block is in the cleaning zone. What this means is that the 'next' operation on these pointers can, if required, make large jumps.

So imagine we have three disks, numbered 0, 1, and 5 (why the jump? Because we used to have disks 2, 3, and 4 once but lost them; the disk numbers need not be contiguous).

If disk 0 has 1,000,000 blocks, and disk 1 has 5,000,000, and disk 5 has 100,000 blocks, then the next block after [0/999,999] would be [1/0]. The next block after [1/4,999,999] would be [5/0]. And the next after [5/99,999] would be [0/0].

We could then store a copy of the superblock on every disk - perhaps in block 0, and have the "first" allocatable block of each disk instead be number 1.

New disks can be added at any point, as long as the existing disk numbers are not affected. To remove a disk, the system will have to do something similar to a clean, and walk the tree to find every live block stored on that disk and relocate it. The disk will have been taken out of the "allocation loop"; the allocation pointer, as the relocated blocks are being written, will skip over that disk if it comes to it. When the tree walk is complete, the disk is now guaranteed free of useful information, and can be removed from the list, freeing up its index number.

Pages: 1 2 3 4

6 Comments

  • By alaric, Sun 5th Sep 2004 @ 7:02 pm

    A few improvements have come to mind since writing the above and mulling ot over lunch today.

    1. It shouldn't be too hard to come up with a scheme where the leaf links in the bottom level of the B-Tree point to extents of data blocks; eg, rather than saying "Logical block 14 of file 7 is over there", it would say "Logical blocks 14-500 of file 7 start over there". This way, large contiguous regions written to files (which are quite common!) can be written contiguously at the allocation pointer and the index updated to suit. If a block in the middle of an existing extent is modified, then the index entry for the extent will need to be split into two extents, skipping the changed block - and inbetween them, a single-block pointer to the new version of the block. This would reduce the size of the index considerably if most files are written contiguously.
    2. For some noncritical data, all the effort of copying blocks on update to avoid corrupting them if a disk failure occurs during writing, and to ensure consistent snapshots for reads during the update, is unnecessary. For example, caches. As such, it would seem logical to provide an interface for the application to overwrite a block of a file 'unsafely', causing an in-place update of the data blocks in question, but one that might leave the file corrupted if power fails during the write. Luckily, this would be really easy to do; just walk the index to find the current data block, and overwrite it. Ok, it means we have to seek around to do the write, but it can be elevator-scheduled like a read.
    3. Note that if we have uncomitted transactions who have been forced to write some of their dirty blocks out to disk to free up buffer space, then the cleaner must know that those dirty blocks are in use. As such, the cleaner should scan the transaction dirty buffer pools as well as a tree traversal to find live blocks to relocate.
    4. It occured to me that it would be really easy to support snapshots of the filesystem state, for short-term "oops I overwrote my stuff" rescues, by storing (say) a list of references to past index roots in the superblock. Or having a linked list of snapshot blocks coming from the superblock, each containing a snapshot name and date, and the pointer to the index root for that snapshot. This then means that the cleaner has to take them into account when doing its tree walk to find live blocks, obviously.
    5. The snapshots of the filesystem would conventionally be read only, but there's actually nothing stopping you making them writable. After all, any updated blocks would just copy-on-write, either leaving the original snapshot unmolested and creating a new root, or updating the snapshot's root pointer. With the former option, you could branch a filesystem, CVS style. Probably potentially useful, somehow...
  • By alaric, Tue 7th Sep 2004 @ 11:37 am

    Another idea I've had is a better allocation strategy for handling filesystems split over multiple disks.

    Instead of having the allocation and clean pointers extend over a giant continuous space consisting of all the disks concatenated together, with big jumps in to allow disks to be swapped in and out without messing with the numbering, one could instead of each partition have its own alloc and clean pointers. When wishing to write out some data, the filesystem can then use any of a number of techniques to decide which disk(s) to put the data onto. For a start, a write could be sent to whichever disk has the most free space, or it could be split evenly across all disks currently accepting new writes (eg, not ones currently being cleaned out because they're about to be removed) to make more efficient use of disk bandwidth. Or all index blocks could go to a special disk dedicated to them, so the cleaner's tree walk only uses up read bandwidth on one disk. Or whichever disk it would appear will be the fastest to write to - based on the estimate current head position compared to the disk's allocation pointer, and the disk's physical write speed.

    The cleaner would be more interesting. When running, it could clear up space on all connected disks in a single walk of the index; each disk would, as well as having a clean pointer, also have a target clean pointer. Blocks between the clean pointer and target pointer of the disk containing the blocks would then be relocated.

  • By alaric, Mon 20th Sep 2004 @ 12:07 am

    Looks like Sun Microsystems read my blog:

    http://www.sun.com/2004-0914/feature/

    🙂

    What does ZFS have that I've not thought of? Their striping technigue is probably similar in effect to my revised multiple-disk allocation system. I was anticipating leaving mirroring to an underlying RAID layer, I don't think there's any benefit to putting it in the volume manager itself - except perhaps for slightly easier system administration.

    128 bit block numbers, eh? I was wondering about using variable-length integers, so you can have huge block numbers when dealing with huge disk pools, but only need 40 bits for filesystems under 512GB.

    Compression? I'd considered that (since writes are done in contiguous segments, one could compress each segment then use a "block number of start of compressed segment, index into compressed segment" address scheme), but decided it not worth while; it makes the system more complex.

    I think their "variable sized blocks" is equivelant to my notion of tracking extents of data blocks in the index, rather than just individual blocks.

    But it looks annoyingly like ZFS is pretty damned similar to my idea - grrr! That's very demoralising. Why do I bother having neat ideas barely days before large companies announce the same idea, so everyone thinks I'm just a copycat? sigh 🙁

  • By John Everitt, Tue 21st Sep 2004 @ 11:10 am

    But it looks annoyingly like ZFS is pretty damned similar to my idea - grrr! That's very demoralising. Why do I bother having neat ideas barely days before large companies announce the same idea, so everyone thinks I'm just a copycat? *sigh* 🙁

    Or worse :(. Infringing Software Patents...

  • By alaric, Tue 21st Sep 2004 @ 9:32 pm

    Indeed! Luckily, my FS idea is really just a combination of existing ideas in databases, file systems, and garbage collecting memory managers... Sun say they have some patent-pending aspects of ZFS, so I'm pretty confident that if any of that overlaps my FS I can point at my sources and say "PRIOR ART!"...

  • By alaric, Thu 23rd Sep 2004 @ 11:15 pm

    Ok, so I've had a bit of a sidetrack thinking how this could be used as a filesystem with POSIX semantics, but of course, for ARGON, it needs to be something different.

    What would be a useful abstract of the filesystem, to expose to the TUNGSTEN storage class layer, is that each entity (identified by a serial number) is composed of objects (identified by serial numbers), and that each object is managed by a particular storage class. Each object contains a key:value map, with both keys and values being arbitrary byte strings.

    The storage class can request an entry from within the object by specifying the key, or write to the object by specifying a key and a value, or to delete an entry given a key, or to iterate through the keys. All of this happens within the context of a potentially nested transaction. To create a new object, just ask WOLFRAM for a new cluster-wide unique serial number, and write some mappings using that serial number as the object ID. To delete an object, just empty it out.

    This is, of course, implemented on disk as a B-Tree where the keys are a combined entity ID, object ID, either an inline mapping key (if it's small) or the precomputed hash of the real mapping key and a pointer to the key elsewhere on disk, and either an inline mapping value (if it's small) or a pointer to the mapping value elsewhere on disk. The precomputed hash is used to speed up equality tests.

    The system writes to disk sequentially, with an allocation pointer and a cleaning pointer onto each disk in the system and a cleaner process that advances the cleaning pointers on request, as mentioned above.

    The system keeps, for each transaction, a hash map of modified mappings, binding the entity ID / object ID / key to a value. The key and the value may be arbitrary byte vectors, but either can be flushed to disk to free up RAM for other functions; if the key is flushed, then it is represented in the hash map by a hash of its contents (so the hash of the EID/OID/key triplet can still be computed), and a pointer to the on-disk copy. If the value is flushed, then the pointer to its on-disk copy is stored.

    If so many mappings are modified in one transaction that the hash map gets too big, then pages of it can be sent to a swap area.

    Therefore, there are three kinds of disk write:

    1. Where we flush out some uncomitted data to free up RAM. In that case, we write a series of keys or values, in a single write. The lower level disk space manager must reveal the position of the write pointer before the write commences, so the system can work out the correct pointers to the starts of all the byte vectors written out in the run. For reasons to be explained later, it is not sufficient for the system to tell us (after the write) what went where; it must be possible to compute the pointers BEFORE the write.
    2. When we commit a transaction. In this case, any uncomitted keys or values are written out, followed immediately by the changed B-Tree index blocks. To do all of this in a single write, we need to know the new pointers to store in the new B-Tree index nodes; thus the requirement to know the allocation pointer BEFORE the write, to compute all the new locations in one go, to store in the index nodes. At the end of this write, the system then needs to write a new superblock to all the superblock mirror locations in the system, containing the new allocation pointer, and the new B-Tree index root pointer.
    3. When we relocate some data due to cleaning. This is very similar to the commit case; we need to write out the relocated keys and values and index nodes, then all the B-Tree index nodes that need updating to reflect the relocation. At the end of the write, the system then needs to write a new superblock to all the mirrors, containing the new allocation pointer, the new cleaning pointer, and the new B-Tree index root pointer.

    However, we need not allocate data in units of disk blocks. Although the disk can only atomically write in units of blocks, so we need to align writes to block boundaries, we can use byte offsets as actual pointers to on-disk data. The allocation and cleaning pointers will always be aligned to physical disk block boundaries. When a write is done, the system will pad the end of the data being written to a block boundary, to preserve this alignment.

    This has interesting consequences. For a start, we are not constrained to fixed size B-Tree nodes; we should still have a size limit on them, for ease of management and bounded processing times, but we need not worry about wasting space at the ends of disk blocks on every node. Also, when we write a series of index nodes, mapping keys, and mapping values, we concatenate them (maybe "machine word" rather than at the byte level) rather than rounding each up to individual blocks. This saves space!

    Providing the storage classes with an interface to the object as a dictionary mapping arbitrary keys to arbitrary values makes life a lot simpler for them than if they has POSIX-style files.

    For an array storage class, it can use the array indices as keys and store the elements in the corresponding values. For small elements, it might be worth storing a page of elements in each mapping; divide the index by some suitable constant to find the page number, which is used as the key, with a series of elements in the value. This instantly gets you a sparse array that can provide access and update in trivial time (logarithmic in the size of the entire system, but significant savings due to index nodes being cached in RAM).

    For a basic "single cell" storage class, it can just use a single mapping from a constant key (zero length byte array), with the contents of the cell as the value.

    Because our mapping keys are variable-length, we can trivially implement multiple independent "virtual dictionaries" within the object by using the ID of the virtual dictionary concatenated with the virtual key as the actual key, and the virtual mapping value stored directly into the actual value. This can be used to implement a storage class like an SQL table, with the rows themselves in one virtual dictionary, keyed on a row serial number, and then other virtual dictionaries containing indexes (mapping directly from keys to lists of row serial numbers with that key in the indexed field).

    A dictionary is such a natural programming abstraction for storage... it makes the storage classes so much simpler to implement compared to POSIX-style stream files, and we can very efficiently provide that abstraction by using a B-Tree!

    This should make it easy to provide a wide range of storage classes, suited to different applications, with relatively little effort. The SQL-style table mentioned above would clearly take relatively little code, for example.

    I can't wait to implement it! 🙂

Other Links to this Post

RSS feed for comments on this post. TrackBack URI

Leave a comment

WordPress Themes

Creative Commons Attribution-NonCommercial-ShareAlike 2.0 UK: England & Wales
Creative Commons Attribution-NonCommercial-ShareAlike 2.0 UK: England & Wales