Snell-Pym

Category: Scheme

What’s neat about elegant languages (by alaric)

Most people who like to use "unusual" non-mainstream language have to have a motivation for doing so. After all, there are reasons not to - a smaller user community, although that can be a good thing, means less libraries, less support for more unusual platforms, and less likelihood of your programming in that language ever earning you money.

When asked, most will say that they find programming in their chosen language easier, but it can be hard to explain why.

However, in a discussion in IRC recently, I think I may have captured part of it:

alaricsp: I do a lot of programming in various languages, and I tend to find that the amount of coolness I can do per line in Scheme is higher, and I get less bugs
alaricsp: As in, I decide to do something complex, sit down and write a hundred lines of scheme, try to run it and get lots of syntax errors due to typos, fix those, then it's semantically bug free about 75% of the time; and in the remaining 25% there's usually just one simple bug (last one was due to me getting confused with some boolean algebra over lists, doing an any? instead of an every? or something like that)
alaricsp: I start in implementation space and build up in thin layers to get to problem space
alaricsp: Cheap easy-to-use abstraction means it's cost-effective to have lots of thin layers
alaricsp: And thin layers are easier to think about, so less buggy

To which somebody else followed up:

sjamaan: Aye
sjamaan: I find the barrier to creating an abstraction in OO languages to be very high, for example
sjamaan: I actually sigh everytime I have to create a new class file!
sjamaan: Whereas I don't even think about creating a lambda
sjamaan: I just do it

Scheme | alaric | Sat 9th May 2009 10:53 am | Comments (0)

C++ (by alaric)

I went the usual route for programmers of my generation; started off in BASIC on an eight-bit home micro, then got a PC and messed around with BASIC there before moving up to Pascal then to C and C++, with meddling with assembly in parallel; assembly was never your main language, just something you had to learn for special things like inner loops and messing with low-level things.

I was interested in programming languages, so I read a lot of books on them, but BASIC, Pascal, assembly and C++ were the only ones I had access to implementations of.

So I was messing with C++ in the early 1990s, with an implementation (Borland Turbo C++ (for DOS!), version 2 IIRC) that had things like classes and iostream, but no templates (so no STL) or RTTI. So these were things I fiddled with fleetingly as I experimented with DJGPP, a port of GCC to DOS, but I was soon using my new-found Internet access to get hold of other programming language implementations to play with, so I never progressed beyond the basics.

But, occasionally, I heard of people doing interesting things with templates in particular. Templates are usually initially explained as a way of implementing type parameters to classes, like Java 1.5's generics, but they can do lots more than that; they are closer in spirit to Haskell's type classes. Templates are literally code templates with parameters, that the compiler fills in with values at compile time (and, so the values can be normal values such as integers - or types - or other templates) to generate syntactic constructs such as classes and functions. Normal usage is to define a collection class template in terms of a type parameter, so the compiler can then generate a collection of ints, a collection of pointers to a given struct type, or even a collection of values of an arbitrarily complex class. Templates can be overloaded more or less like C++ functions, so you can have multiple templates with the same 'name' (and signature) but with some being partially specialised, which means that template instantiation can contain conditionals, in effect.

Which is where it gets complex. Templates can recurse, generating entire class hierharchies. See, as well as the usual case of a class having a member that's of a type given as a template parameter, a class template could use a template parameter as a class to inherit from. By recursing on that parameter, hierarchies can be built on the fly. Inline function templates that recurse can create arbitrarily complex bits of code. In many ways, it rivals the power of Lisp's macros, except that it was blatantly never designed to be that powerful, so you have to use crazy exotic workarounds built by people who have studied the C++ spec with a fine-toothed comb in order to do all sorts of things.

Oh dear.

Let's take a closer look at what went wrong.

Those of us who are familiar with one or more programming languages may sometimes find it hard to see things objectively. There's too much accumulated habit in programming, which sometimes prevents us from seeing the wood for the trees. Also, some people who are not familiar with programming may be interested to know what on Earth I'm talking about here, so I'm going to kill two birds with one stone and shift into the soft and floppy world of metaphor.

Let's imagine that programming is like building houses and bridges and things like that. Programming languages, in this metaphor, are construction techniques. Assembly language is like building the structures with an atomic force microscope, positioning atoms by hand in order to build metal crystals, stones, and mortar, in order to create reinforced concrete structures. Although this technique will let you build the strongest possible structures by simply arranging carbon atoms in a tetrahedral structure with evacuated cavities to decrease weight and crystal dislocations in order to harden the structure through stressing - it will take you an age to build anything large, and when you try and split the construction process up, you have to be very careful that the faces where subcomponents meet will match exactly; they need to be precise to an atomic level, since there's no equivalent of slopping mortar into a gap, that will automatically ooze to match the surfaces on both sides.

Also, building a structure that way is intimately tied to the local chemistry. A bridge made out of solid diamond, as described above, is useless in an environment with a high-oxygen atmosphere and a normal temperature range of a few thousand degrees Kelvin, as diamond burns quite nicely. A design for a diamond bridge expressed at the atomic level can't be easily translated into a bridge of the same shape in steel. Even things like the crystal dislocations to harden it and little vacuum bubbles to decrease the weight don't apply in the same way to steel. The metaphor here being that assembly language applies only to a given CPU type.

BASIC, on the other hand, is like Lego. You can very quickly build small structures, even with little skill. They'll look a little lumpy, but it gets small jobs done quickly and is easy to learn. Indeed, it's great as a training tool for potential future engineers, although a lot will have to be unlearnt in order to move from Lego to poured-concrete construction. Also, the same design can be realised in plastic bricks, diamond bricks, or frozen methane-ice bricks, meaning the same design can be applied in lots of different chemistries.

Mainstream high-level languages like Pascal, C, C++ and Java are like assembling things from premade blocks - from bricks up to giant prefabricated beams and lintels. They even let you come up with your own prefabricated components by specifying how they should be built from basic components, although they only come with small basic units, barely higher than those offered by assembly (much smaller than the bricks of BASIC); but at least those basic units are mainly independent of chemistry. Languages with module systems make a surprising difference for building large practical structures in a commercial environment - modules are like catalogues of prefabricated components, which make it easier for them to be shared and reused between projects.

Declarative languages are like building a former out of wood and pouring concrete into it. Rather than explicitly positioning structural members, we just specify the overall shape we want, and let an automatic process (the flowing of liquid concrete under gravity) fill in the details for us. It's quick, but it doesn't give us fine control over the result we get. No graceful suspension bridges.

Dynamic languages are a bit like our BASIC lego bricks (plus the ability to order prefabricated components like higher-level languages), except that we only get one kind of fundamental brick; the good thing is that this one brick can support load AND conduct electricity AND transport water AND transport sewage away. This means you can build some very simple and compact buildings, by having the walls transport electricity and water to where it's needed and take waste water away, but you have to be careful to make sure the bricks don't get confused and do the wrong thing (such as feeding electricity into the water supply, or spewing sewage out of the wall).

And then we get to languages with metaprogramming. Very basic metaprogramming - perhaps at the level of C macros - is a bit like being able to ask for prefabricated components made to custom dimensions. Rather than a catalogue listing "50cmx50cmx10m pre-stressed concrete beam", we can have an "XcmxXcmxYm pre-stressed concrete beam", and fill in our own X and Y when we order it.

Whereas C++ templates and Lisp macros are like being able to set up companies that build entire arbitrarily complex building modules to spec. Writing a metaprogramming abstraction is like setting up a company that, given the width and depth of a river and the size of a road, will return you a standard bridge to take that road across that river. The downside is that it'll be their standard bridge that looks about the same as all their other bridges; but the upside is that if you don't like it, you can still build your own bridge out of basic components, or design a bridge template of your own that you reuse. Indeed, you could design a bridge template for multi-lane roads that works by building lots of a single-lane bridge template, side by side. Or a template for long bridges that works by building any number of a simple arch bridge (which can only cross a given maximum distance) between pontoons sunk into the river bed.

But the problem is that C++ is an extension to C. C is a very low-level language, barely above assembly as these things go; every construct in C has an obvious and simple representation in assembly language. C++ is an attempt to add high-level loveliness such as metaprogramming and catalogues of large components on top of that.

And, as such, it's hampered by its low-level past. C++ programmers have to worry about low-level details that higher-level languages completely handle for you. Such as storage management, and only limited runtime type information meaning that a lot of information has to be made statically known at compile time.

Going back to the building metaphor, the fact that C++ requires the documentation for an API to be clear about whether passing in a pointer counts as passing ownership (with the obligation to delete), and under what circumstances that object may be deleted (which has a bearing on what else can be done with that pointer by the caller after it's been passed in), is a bit like having a high-level catalogue of building parts, but requiring them all to come with chemical formulae for and accurate engineering drawings for their joining surfaces, and requiring the users of such prebuilt parts to examine every case where the parts will touch other parts, so they can check them for chemical compatibility (bolting steel to bronze parts won't do, as they'll electrolytically corrode each other), and making sure the faces will mate correctly; if they won't, then the designer will need to allow for some kind of mortar to go between them, which in itself will have to be chemically compatible with both surfaces.

While higher-level languages are a bit like having international standards for load-bearing surface connections (standard sizes of bolts, standard surface coatings that are chemically compatible), electrical connections, etc. It makes it all a whole lot easier. The cost is a loss of fine control; in a very few circumstances you might need to really control how two parts are connected, perhaps in making a bridge that, in an earthquake, will fall apart in a very controlled manner. But you need to precisely specify how everything mates even when you don't really care, which slows down the design process, and makes it easier for a human error in the construction process (the wrong kind of mortar used in a joint - they all look the same!) to cause a problem that only becomes apparent years after the structure is completed (when the slow corrosion of a beam by the acidic mortar causes it to collapse). C++ is ripe with subtle cases that produce "undefined behaviour". Calling delete on a polymorphic class without a virtual destructor. Passing an instance of a class to a function with ellipsis arguments. All of these things are easy to do, cause no compile-time errors, and may well work fine at run time most of the time. Whereupon they are almost impossible to trace back to their cause.

I'm not saying that people should never be given the power to specify things at that level - but that it should be done by letting users go to that level of detail explicitly. For example, being able to specify a component at the atomic level, but then packaging it along with information about its surface properties so the high-level building design system can automatically and correctly integrate with it. Languages like Chicken Scheme let you embed C code, as long as you tell Chicken the types of all values passed in and out so it can perform automatic wrapping and unwrapping.

C++'s templates get around some of these problems; it's possible to make templates that automatically adapt their interfaces depending on what they're interfacing to. This means that the users of the templates can just let the magic happen for them and not need to worry much about it, but it means that template developers need to understand the issues and anticipate them in advance, to make sure their templates will work correctly for the user.

Also, templates have been stretched beyond their original design, which is like using a system for automatically choosing the right material to use as mortar to fill a gap, with support for including layers of other materials such as damp-proof courses, to build entire pillars by telling them to fill a very large gap with repeating layers of concrete. It gets the job done, but it's working around the system rather than working with it. In the resulting design, a pillar is labelled as a "gap that needs filling with something" rather than as a pillar.

There's a great ingenuity there. I'm in awe of the job Bjarne Stroustrup and the C standards committee have done in building such powerful facilities on top of such a meager language as C. I think it's misguided, but brilliant. And I'm in awe of people like Andrei Alexandrescu who have figured out how to make C++ do useful things it was never designed to do, through cunning and devious tricks.

The same kind of cunningness is shown by the engineers who look at things like quantum mechanics and use them to invent the MOS transistor, and from there, figure out how to mass produce vastly complex integrated logic circuits for pennies. It's amazing to take what resources physics throws at you and manage to build things like computers out of it, just as it's amazing to take what the C++ language specification throws at you and manage to produce a template that works out if one type is convertible to another by - get this - setting up two functions, overloading the same name, one with ellipsis arguments and the other of the target type, but returning results of different sizes; then creating a function that returns a value of the source type, and examining the value of sizeof(func1(func2()) to see if it's the same as that of the return type of the fallback ellipsis function (which matches anything) or the more specific function, to see which matched.

It gets the job done, but it takes a wizard to figure out how to do it. Sure, the wizard can do it and wrap it up in a nice little reusable package that anyone can use, but it shouldn't have to be this way. Semiconductor engineers have to do complex things to get faster chips because they have no choice in their substrate. But programmers do have a choice - they can choose a better language.

I feel that it is the obligation of language designers to make their language such that useful things can get done without hacks only wizards can come up with. Anybody should be able to do useful things.

I'm renowned for liking Lisp. Most of the clever tricks done with templates are trivial as Lisp macros, or even as plain old Lisp source without needing macros, or are just completely irrelevant in Lisp (smart pointers? Hahaha!). Many things are trivial in Lisp that aren't worth doing in C++, because C++ only allows programs that can be type checked at compile time, and only a subset of correct programs can be statically proven correct in any given automatic checker. But it's not all rosy; templates arrange to do a lot of the stuff done at run time in Lisp at compile time. The reason they're cranky and complex is that doing a lot of this stuff at compile time is a lot harder than at run time; but getting it done at compile time means type errors are caught during compilation rather than lurking at run time, and the compiler can generate very optimal code.

The Lisp community has done some work on adding optional static typing; Common Lisp allows type declarations, and compilers may use them to generate code on a par with a C compiler, but I've never seen a typing system as rich as C++'s with templates and so on in a Lisp.

It'd be an interesting experiment - combing the power of templates to statically type-check complex stuff, with the power of not HAVING to statically type-check everything. How would it impact on the design of the standard libraries? Primitives like car (which returns the head element of a linked list) would need to have complex types to support both cases: (car <List>) :: <Anything> for the general case, but (car <List(X)>) :: [<X>|Error], as attempting to call car on an empty list of Xs is an error - unless we keep exceptions out of the type system.

Computing, Scheme | alaric | Fri 30th Jan 2009 12:36 am | Comments (0)

Ugarit: initial beta (by alaric)

I'm pleased to announce the release of the first beta release of Ugarit, a backup/archival system based around content-addressed storage, written in Chicken Scheme.

This initial release supports archives stored in the filesystem, including on remote servers via NFS and other such protocols. Future versions plan support for storage of archives in S3 or on remote hosts via SFTP/SSH, and a pluggable storage backend system allows for many other forms of archive to be created.

Ugarit provides efficient snapshots and restores, without requiring intelligence of its storage. Anything that works roughly like a filesystem can be used as a Ugarit backend, and it is designed to minimise the size of data sent to the archive, in order to reduce transfer and storage costs on services like S3, and snapshot time.

I've tested it on various test filesystems, ranging from a contrived example with all sorts of funny things like FIFOs and devices in, up to 500MB of /usr/pkgsrc and >2GB of /usr. I'm going to see if I can borrow some big hardware at work to test it on some many-hundreds-of-gigabytes filesystems as well, to see if I can find any scaling issues, and I'm currently putting it into place as my personal backup system. However, this is still beta software, so please be careful and test your backups!

For details and installation instructions, see the Ugarit project page.

Future developments planned include:

File modification time caching, reducing the time taken to identify changed files to snapshot.
Encrypted archives.
Replicated archives, supporting both fault-tolerance over multiple archives and local caching, where extractions are serviced from a local archive, but if the local archive is lost (even just partially), a remote archive can provide the missing data.
More storage backends
FUSE support, so you can browse your archive as a read-only filesystem

Alaric, Crypto / security, Scheme | alaric | Tue 27th Jan 2009 10:53 am | Comments (3)

Ugarit interactive restore (by alaric)

Ugarit is coming along nicely. I've written the interactive archive exploration/extraction shell, although it's still a bit ugly (mtimes are still just displayed as a number rather than in a human-readable format, and the fields in ls -l outputs aren't padded to fixed widths, you can only cd up or down one level at a time rather than using a path, and little things like that).

Here it is in action, starting from the top of an archive with a single tag called Test that has two snapshots at different times. current just refers to the most recent snapshot of the two. I extract LICENCE.txt then take a look to see how it came out.

> ls
Test <tag>
> cd Test
/Test> ls
time<1232405984.074> <snapshot>
time<1232405984.162> <snapshot>
current <snapshot>
/Test> cd current
/Test/current> ls -l
-rw-r--r-- 1000 100 time<1231987453.0> README.txt
lrwxr-xr-x 1000 100 time<1231988569.0> LICENCE.txt -> subdir/LICENCE.txt
drwxr-xr-x 1000 100 time<1232013672.0> subdir
drwxr-xr-x 1000 100 time<1232155290.0> .svn
prw-r--r-- 1000 100 time<1232052740.0> FIFO
crw-r--r-- 0 100 time<1232154570.0> chardev
brw-r--r-- 0 100 time<1232154578.0> blockdev
/Test/current> cd subdir
/Test/current/subdir> ls -l
-rw-r--r-- 1000 100 time<1231987453.0> LICENCE.txt
drwxr-xr-x 1000 100 time<1232155290.0> .svn
/Test/current/subdir> get LICENCE.txt
Extracted LICENCE.txt
/Test/current/subdir> bye
-bash-3.2$ cat LICENCE.txt 
Copyright (c) 2008-2009, Warhead.org.uk Ltd

All rights reserved.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

Neither the names of Warhead.org.uk Ltd, Snell Systems, nor Kitten Technologies, nor the names of their contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

It's getting there! The only thing that's really holding me back now is that I have limited Internet access in the evening to read the manuals for the Chicken eggs I want to use.

Mainly, I need a command-line argument wrapper, and an encryption engine.

I want to offer the choice of no compression or deflate compression when writing into the archive, but that each block should be marked with a prefix byte stating its compression algorithm so that a block can be decompressed no matter how it was compressed, as long as your copy of Ugarit knows the algorithm - I'll add lzma, and make it the default, as soon as I've written a wrapper to liblzma.

Then I want a choice of encryption algorithms, which I plan to do by writing a wrapper for libmcrypt, rather than using the cryptlib interface for Chicken I've already found, as I don't seem to have cryptlib in pkgsrc on NetBSD, and libmcrypt looks nice and simple.

There's a standard Scheme library for command line parsing, called args-fold (yes, another form of fold...), which does a similar job to getopt libraries. But I don't have it installed yet. And I need to check out a Scheme library I saw for indentation-delimited syntax, that might make for a configuration file format more appealing to parenthophobes...

Computing, Scheme | alaric | Wed 21st Jan 2009 6:00 pm | Comments (1)

fold>cons (by alaric)

Or should I say, (> fold cons)

In a way, I'm sad that I started my life programming imperatively rather than functionally. While I feel that imperative programming is still the most appropriate for some problems (particularly ones that really do look like state machines), and logical programming is great for dealing with big blobs of knowledge (such as databases), most programming tasks, I think, are best handled functionally.

But imperative programming is in fashion; it's what people are taught, since it's what people use, and it's what people use, because it's what they were taught, and lots of feedback loops maintain that status quo; there are more tools for imperative programming, and more books, and so on. Lots of people think functional programming is somehow mind-bending and difficult, but that's just because they don't know it well, and the first taste of something unusual often seems bizarre and worrying.

So, although I find functional programming more efficient and pleasant than imperative, I don't get to do very much of it. And, therefore, I'm really still learning it.

While working on Ugarit, in particular, I learnt a new trick. And that trick was the true usefulness of fold.

fold is a function in Scheme and similar languages that is used to apply a supplied function to every element in a list, but rather than making a new list of the results like map, the result of each function call is just passed into the next one. In order to get the process started, you have to supply an initial value to pass into the first call.

Here's what it'd look like in C, assuming you were dealing with an array of ints:

int fold(int *int_array, size_t array_len, int (*kons)(int element, int acc), int knil) {
  int acc = knil;
  while (array_len--) {
    acc = kons(*int_array++, acc);
  }
  return acc;
}

The names kons and knil are hints as to what they do; hints that I should have noticed, but didn't. But more on that later.

Now, you might use that to add up the elements of an array, by making a function that adds its two integer arguments and returns the result, and passing it to fold with 0 as knil. Or multiply them all together. Or something.

Now, I rarely have to add up the elements of lists, so when I read about fold in the Scheme specification, I thought "Meh, I'll remember that when I next need to add up the elements of a list", and mentally discarded it into the pile of barely-useful tools.

My mistake was that I was thinking of fold as a function to apply a function to every element of a list. Which is rather the wrong way round. fold is really a mental model of a sequence, which the function fold is just an implementation of for lists. It should be called list-fold.

What made me realise this was reading the interface to Alex Shinn's gdbm egg. It's a wrapper to the gdbm library, which provides file-based persistent key-value mappings.

One of the operations the gdbm library gives you is to iterate over the whole database. There's a gdbm_first function to get the first record, and a gdbm_next function to get the next one. Call gdbm_first, then keep calling gdbm_next until you stop getting records. That's an imperative interface, as it involves having a "cursor" that each call to gdbm_first or gdbm_next moves around; it's all about altering state.

But in the gdbm egg, Alex provided a more functional way of doing that. He provided gdbm-fold, which accepts a gdbm database handle, a kons function to call on each record in turn (along with the return value of the last call to kons), and a knil value to get the process started. It returns the result of the final kons call.

Now, the interesting thing is that if you just want a list of records, you can pass in an empty list for knil, and for kons, a function that takes a record and a list and returns a new list starting with that record and then continuing with the list passed to it (eg, taking that list and sticking the new record on the head of it). Indeed, where it not for that fact that gdbm-fold calls kons with three arguments - the key of the record, the contents of the record, and the result of the previous kons - you could pass in the standard Scheme function cons as kons, and the empty list '() as knil. Which is where the names kons and knil come from - '() is pronounced "nil". cons is an acceptable kons function.

Indeed, you can use traditional fold to copy a list, by passing in cons and '(). It'll come out in reverse order, mind, since fold will start at the beginning of the list, and the first cons will join the first element of the list onto '() to get a one-element list, and the second cons will join the second element of the list on top to end up with a list that has the second element then the first element, and so on, until you get the entire list reverse.

But this is the interesting thing about fold. Folding lists isn't actually all that interesting. What's interesting is that anything list-like can have a fold operation.

See, when working on Ugarit, I came across various sequential structures that live within archives. When a file is chopped into blocks and stored, the entire file is referred to by a reference to a list of references to all the blocks, that lives in the archive. And a directory is a list of directory entries. And a tag refers to a chain of snapshots going back through time.

When I wrote my first of these structures - the list of references to blocks, which I called a key stream - I needed a function to get a key stream back out of the archive so the blocks of a file could be brought out in order and written back to disk in an extraction operation.

My first instinct, of course, was to write a function that would recurse over the tree structure that's used to store a key stream within the archive (because if a key stream is too large to fit in a block, it needs to be split into blocks, and then another higher-level smaller key stream used to collect that list of blocks...), building up a list in memory which it would return. The problem is, each of the blocks that represent a file represent a megabyte of data; a file that's a terabyte long would involve a million keys, and building up a list in memory to hold them all struck me as fundamentally wrong. gdbm-fold let you iterate over the entire contents of a database without loading all of it into memory.

So I wrote fold-key-stream, a function that takes a reference to a key stream, a kons function that is called on every key in turn, and a knil value to get the ball rolling. Having seen gdbm-fold, I had a hunch this was the right approach. I expected to have to perform a mind-bending turning inside out of my simple tree walk, but when I sat down and did it, the right answer fell out easily:

;; kons is called on (key type accumulator) for every key in the stream, in order
(define (fold-key-stream archive key ks-type kons knil)
   (define type (archive-exists? archive key))
   (if (eq? ks-type type)
      ; Recurse
      (begin
         (define subkeys (deserialise-key-stream (archive-get archive key)))
         (fold 
            (lambda (subkey acc) (fold-key-stream archive subkey ks-type kons acc))
            knil
            subkeys))
      ; Leaf node
      (kons key type knil)))

The function asks the archive for the type of a block, given that block's key, to see if it's followed the tree all the way down to the leaves or not, since the key stream blocks will all have a particular type (ks-type) while the leaves - the actual blocks the key stream points to, such as file data - will have som different type. So you could call fold-key-stream on a block that's not even a key stream block, and it'll treat it like a key stream with just one element in, that block.

So the code starts by checking the type of the block. If it's not a key-stream block, it's a leaf block, so we just call kons on it, passing in knil since this is the only call to kons, and return the result. Job done.

But if it's not a leaf node, then we go into the (begin ... bit. The first thing we do is to read the key-stream block and convert it into a list of keys, which we call subkeys. And then we get funny, by calling that useless fold function for summing up lists on it. Now, we have a list of keys, which may be actual keys to call kons on, or keys of further subtrees. But we already know how to handle both those cases; if it's a leaf node, fold-key-stream would just call kons on it, and if it's a list of further subtrees, then each of them is actually a smaller key-stream in itself, and fold-key-stream can deal with them... so we just wrap fold-key-stream in a little wrapper function that adds in the extra parameters, such as the reference to the archive, and use fold to call it on every subkey in the block. fold provides the basic plumbing for us, of actually calling the function and passing in the accumulator parameter. The clever part is that each call to fold-key-stream is passed the knil from the previous call; we're recursing folds within folds to walk down a multi-way tree, but the chain of passing knil into kons is preserved.

So really fold is more like a functional foreach operation: the fun isn't in calling it on lists, but in providing fold operations for all sorts of sequential data structures. Since then I've written several fold operations for various data structures within the archive. Indeed, one of them is a high-level ls operation on nodes within the archive, from the root that lists all the tags in the archive, to each tag that has a list of snapshots, to the directories and files within the snapshots. I will use this with a kons function that prints out each entry in turn, pausing for a keypress after every twenty or so, and return to a store continuation (sort of like a longjmp in C, or throwing an exception) to break out of the fold if the user presses 'q'.

In practice, any function that builds up a list and returns it probably ought to be recast as a folder; this can be done simply by letting the user provide their own cons and '() (let's call the kons and knil...) rather than using cons and '() to build a list up yourself. Then the user can get a normal list if they want, or can directly process the elements in turn, rather than getting a list from your function that they immediately strip down.

So, a fold function is, in effect, a way of representing a list. Or, at least, a potential list, that doesn't exist until you invoke it. Which leads one to think of defining list operations like map over them...

Scheme | alaric | Tue 20th Jan 2009 6:00 pm | Comments (0)

Sarah and Alaric Snell-Pym living in interesting times

Category: Scheme

What’s neat about elegant languages (by alaric)

C++ (by alaric)

Ugarit: initial beta (by alaric)

Ugarit interactive restore (by alaric)

fold>cons (by alaric)

Search

Categories

About Us

Ada Lovelace Day

Business

Family

Fictional Friends

Friends

Mind candy

Projects

The Salaric Blogs

Archives

Meta

Snell-Pym

Sarah and Alaric Snell-Pym living in interesting times

Category: Scheme

What’s neat about elegant languages (by alaric)

C++ (by alaric)

Ugarit: initial beta (by alaric)

Ugarit interactive restore (by alaric)

fold>cons (by alaric)

Subscribe

Search

Categories

About Us

Ada Lovelace Day

Business

Family

Fictional Friends

Friends

Mind candy

Projects

The Salaric Blogs

Archives

Meta