Designing a general data model (by )

All that talking about HYDROGEN was such fun, I was sad it was over.

So what could I explain about ARGON next? My design for CHROME is currently just a list of features and a plan for how to put them together, with no details whatsoever; TUNGSTEN is still in flux because I have research to do in the field of replicated storage, LITHIUM depends on TUNGSTEN, MERCURY depends on CARBON and LITHIUM, and so on.

So, I guess I'll have to talk about IRON, which is a data model.

How everyone else does it

Few people would think that a data model would be important in a software platform project; after all, they're never headline features in Java, .NET, or POSIX. But mainly, this is because they all share roughly the same core data model, for historical purposes - and then a slew of other data models on top, that have evolved to try and plug the holes in the historical one.

Most programming languages have a model of strings, integers, floats, arrays, and some kind of structure type. And there's usually pointers, or a safe reference type. Many have function pointers, too, and some are so lucky as to have closures. Some specify that the runtime environment must collect your garbage for you, others need you to tell it when things aren't needed any more. The higher-level data models that get built on top are things like XML, XDR, ASN.1, the table model of SQL, or the chunked structure of PNG and IFF files. There are varying degrees of separation between each data model, and one or more representations of data in that model as strings of bytes.

HYDROGEN has a data model; memory's a sea of allocation units indexed by numbers from some arbitrary base, and sequences of those units may be interpreted as characters, floats, 'cells', or signed or unsigned integers of specified bit widths. We have subroutine handles rather than function pointers, which may or may not actually be pointers. It's all very low-level, as that's the level HYDROGEN works at, but it's none too pleasant for higher-level programming.

How we do it

So I've defined the IRON data model. It's the data model that CHROME programs manipulate, that the CARBON global knowledge base is based on, and that TUNGSTEN will persist. It's four things, really - a notional data model (like ASN.1), a textual notation for humans to edit (like XML is), a binary format for efficient and compact interchange between computers (like XDR is), and an in-memory representation defined in terms of HYDROGEN, which is not specified; however, a set of HYDROGEN words to manipulate the in-memory representation (including converting to and from the textual and binary forms) is defined.

CHROME, being a homoiconic language (not to be confused with a homo icon, though), actually uses IRON as its syntax; CHROME programmers will write in the IRON textual notation. Similarly, low-level editing of CARBON knowledge bases will be in the IRON syntax. Not that I plan for big strings of IRON source code to be kept around; it'll all live in TUNGSTEN in binary form, and be spat into textual format for human editing (with that human's favourite conventions for indentation; and maybe even that human's favourite textual syntax, if they prefer a non-standard one), and parsed back when they're finished.

Many people will tell you that textual notations as the default for storage are a good idea; but that's only true in a POSIXy world of files that are easy to drag into a text editor (just leaving you to battle with whether it's actually a text file or not, and what character encoding it's in). In the world of ARGON, we'll be storing almost everything in TUNGSTEN, which is a store of IRON data by design; so the low-level editing tool will be an IRON editor...

Details, details

It will be no surprise to those who know me that IRON has a distinctly LISPy feel to it; indeed, we have traditional S-expression lists, written in the traditional way as space-separated lists of items in parentheses, and dotted pair notation for improper lists. Numbers and strings work pretty much as in Scheme. However, unlike in s-expressions, no great emphasis is put on lists, as we have records instead; these, and the way we handle symbols, are the two major differences between IRON and the s-expression data model.

I'll start with symbols. Symbols in s-expressions are just another form of string, written without quotes. The difference between them and normal quoted strings is that, when loaded into memory, they are all stored in a global hash table, and then represented purely as pointers to the strings. So two symbols composed of the same sequence of characters will have but one in-memory representation; they're quick to compare, as the pointers can just be compared. Therefore, symbols are intended for short strings that are primarily compared with each other; Lisp doesn't provide operations to find the nth character of a symbol, or the length of it (although you can ask for the string representation of a symbol, then do those things). They're normally used as identifiers in source code, or flag values in other contexts.

But there's endless problems with name clashes. Some library of useful code might use the symbol foo to identify a function that does one thing, while another uses foo to identify a function that does something entirely different. Code that wishes to use both libraries has to have some way of resolving the ambiguity.

XML has a form of symbol, although it's not really exposed as such - element and attribute names. And XML also ran into the same problem of name conflicts, so they defined the XML namespace system, whereby each symbol is bound into a namespace, identified by a URI. The idea being that everyone who defines a set of symbols as having certain meanings will do so in the context of a namespace identified by a URI they are responsible for, and therefore can uniquely assign to that mapping between names and meanings. Therefore, two different namespaces can both define a foo, and XML files can use namespace declarations to say which foo they mean, and the problems go away.

IRON does something similar with its symbols. Except that CARBON will define a naming system that suits our needs better than URIs, which have various thorny problems, so we define our symbols in terms of those. In fact, that's all an IRON symbol is - a CARBON name. One can write a full CARBON name surrounded by angle brackets as a symbol, such as </argon/iron>; or one can just write a path relative to a current namespace, such as iron or iron/note; or one can write one relative to a declared namespace with a prefix, such as fe:note. If the current namespace is set to </argon> and the declared namespaces are fe -> </argon/iron>, then fe:note, iron/note and </argon/iron/note> all denote the same symbol.

The current namespace can be changed within the scope of an IRON value and its sub-values by writing !defns </argon>, or one can define a declared mapping within the same scope with !ns fe </argon/iron>.

Records are the next interesting thing. Notionally, records have a type, which is a symbol, then zero or more fields, which can be anything. They're rather like compound terms in Prolog.

But the fun is in the syntactic sugar they support.

The most basic syntax is the symbol then the fields, separated by spaces, and surrounded by square brackets. For example, [cr:+ 1 2 3], or [</foo/bar> "Hello"]. As you can imagine, this is good for representing Lispy source code, or indeed anything that is logically a tree with typed nodes; although the observant will notice that forcing the first element to be a symbol is more restrictive than normal Lisp evaluation semantics, which allows any expression there.

But it doesn't stop there. Inspired by the often-elegant yet simple syntax of Smalltalk message sending, if the last part of the record's type symbol contains colons, it can be written in parts, delimited by the colons, distributed amongst the fields.

So [foo: 1 bar: 2 baz: 3] is the same as [foo:bar:baz: 1 2 3], but nicely assigns labels to the three fields.

There's more - many records wrap a larger object in order to annotate it in some way, attaching metadata. By convention, such records should be named with a symbol that ends with two colons, and take the object that the metadata is attached to as its last field. Then one can write [rating: boring]: foo as shorthand for [rating: boring : foo], which is in turn shorthand for [rating:: boring foo]. The [...]: ... notation means that the annotating record doesn't need to have its closing ] at the far end of the annotated object, where it can get rather lonely and hard to trace back if the annotated object is large.

Like in s-expressions, as well as lists, we have vectors, written #<1 2 3>. The notional difference is that lists are explicitly made of two-element cons cells, so can be improper lists, while vectors are restricted to being sequences of elements, with no expectations about their representation other than that they're more likely to have characteristics like a contiguous array in memory (eg, constant-time indexing, but possibly linear-time concatenation). However, for efficiency, vectors support specialisation to homogeneous forms, like Scheme's SRFI-4 vectors; one can declare a vector of u32 values (unsigned 32-bit integers), that cannot contain any value not representable in such a form. Knowing this, the implementation may implement such vectors very efficiently indeed, without needing any dynamic typing information. Indeed, quoted strings in IRON are just shorthand for homogeneous character vectors. "abc" is a nicer way of writing #char<#'a' #'b' #'c'>.

Other than a few fine details about Booleans and funny characters, that's pretty much how IRON's textual representation and basic data model work. I've not sat down and bashed out a precise binary format yet, but it'll be pretty easy; the fun part will be the compression algorithm (I had some interesting ideas about that).

However, the HYDROGEN API for manipulating IRON values, as I intimated, also implements garbage collection for them with close cooperation from the HELIUM memory manager; and, more unusually, enforces that any given value may either be unique, or immutable; and that immutable values may not contain references to unique values. As there can only ever be one reference to a unique value, we can mutate it for efficient update while still maintaining the illusion of purely functional data structures, knowing that our update function accepts the only reference to the object so that nobody else can notice if we happen to return one that's just been mutated rather than created afresh. And unique values do not need garbage collection, as we can deduce statically when the only reference to an object is lost, and generate code to deallocate it there and then. Concurrent Clean is, to date, the best working example of what this style of data model lets one do.

Rationale

I think it's important to push the notion of a data model down low in a system. POSIX and the Internet have taught us to work with strings of bytes as a lowest common demoninator, which is terribly unfriendly to users; they end up being exposed to character encodings and the like. Programmers are continually having to reinvent the wheel with basic details of file formats.

The OSI model of networking, like the Internet, worked in terms of blocks of bytes up to about the layer of TCP; but then, unlike the Internet, they then slapped on the Presentation Layer. The idea was that, once the basics of the network connection were established, the byte-stream model was promptly abandoned, and everything above that (including application protocols, like the Internet's HTTP and SMTP) defined in terms of the ASN.1 data model; at the presentation layer, the ASN.1 values were encoded into blocks of bytes to be passed to the Session Layer beneath.

Whereas in the Internet, each protocol has to have its own conventions for character encodings, and how content types are conveyed. And so implementations of each protocol need to deal with this separately, making them complex, and end up causing usability problems.

So by putting IRON in at the base layer of ARGON, I'm hopefully going to avoid all that... while making my own life simpler, as higher layers will not need to worry themselves with such issues 🙂

Making sure the model is rich enough, with a written syntax that's simple enough, to represent things like source code is also an effort-saver. After all, what use is there in a notation that's not pleasant enough to write code in? XML is often criticized for its verbosity; it's designed for marking up text, where you have lots of text with the occasional <element>, so making them stand out at both ends (with </element>) was more important than making them easy to type; when people started (ab)using XML for data interchange, producing documents that are more tags than content, people quickly complained of RSI. And so after a brief spurt of buzzword-driven development, XML as a data interchange format fell away to things like JSON and YAML; while, as usual, people in the LISP world have been using s-expressions for code and data since the 1950s. If only the world had listened, eh?

But s-expressions can sometimes be awkward, particularly when dealing with larger structures, which is where the records come in. I'd found the Smalltalk syntax was often more pleasant for larger structures, like conditionals: it can be hard to spot the structure of a Lisp if if the three parts are long, requiring the eye to line up indentation over a raft of white space:

...
  (if (or (foo-bar-baz wibble)
          (fnargle (if wibble
                       (foo wibble)
                       (bar wobble))))
      (make-fnargle
        (foo (bar baz (wobble)))
        (bosh
          (bash bam bazzle)
          (wortlethorp wrangle)
          (wham (lambda (fish)
                  (let loop ((food 'cheese)
                             (catfood fish))
                    (if (much waffling)
                      (if (eq? catfood 'potato)
                        (loop 'yoghurt catfood)
                        food)
                      (loop food 'rabbit))))))
        (lambda (bob)
          (let ((me die)
                (in peace))
              (plea-for-help me in))))
      (cdr (assq fnargles (foo (bar baz (wobble))))))
...

Whereas the presence of record markers and an if:then:else: control structure makes things a lot nicer:

  [if: [or [foo-bar-baz wibble]
          [fnargle [if: wibble
                    then: [foo wibble]
                    else: [bar wobble]]]]
   then:
         [make-fnargle: [foo [bar baz [wobble]]]
          mode:
            [bosh
              [bash bam bazzle]
              [wortlethorp wrangle]
              [wham [lambda [fish]
                      [let loop [[food [' cheese]]
                                 [catfood fish]]
                        [if: [much waffling]
                         then: [if: [eq? catfood [' potato]]
                                then: [loop [' yoghurt] catfood]
                                else: food]
                         else: [loop food [' rabbit]]]]]]]
          suicide-thunk:
            [lambda [bob]
              [let [[me die]
                    [in peace]]
                  [plea-for-help me in]]]]
   else: [cdr [assq fnargles [foo [bar baz [wobble]]]]]]

But what about things like images and databases and so on?

Not all files and other bits of serialised data have an obvious tree structure. What about pixmapped images, and other bulk numeric data? What about files that are modified in-place or randomly accessed, such as databases? These are all things that s-expressions, XML, and so on are notoriously inappropriate for.

Well, the presence of the homogenous vector types sorts out the images and bulk numerics; where in XML you have to resort to base64 or complex packaging systems where the XML file and its 'attachments' are bundled into some kind of package, we can just put such data in a vector. Arbitrary byte streams from the world of Posix and the Internet can be stuffed into a #u8<...>. Images can be written into triples of #u8<...> #u16<...> or even #float<...> vectors (those triples being L, u and v* channels, of course!).

Interestingly, I'm suggesting that the images would be stored as plain pixel samples, rather than compressed in some way. I think that lossless compression of two-dimensional numeric arrays shouldn't be specific to image file formats; why not define the binary encoding for IRON to have (at the encoder's option) the ability to statistically analyse vectors, and to choose from a range of different encodings for them? The binary IRON encoding will handle dictionary and entropy compression at a lower level, but there could be different ways of writing a numeric vector (such as using predictive differential encodings in order to enhance their compressability at the lower levels.

And as for the random-access files - that's simply beyond our scope. Higher-level mechanisms within TUNGSTEN will handle that for us, by placing IRON values into B+ trees for us!

Conclusion

s-expressions are already pretty close to what I wanted, in terms of features; but they don't have a binary encoding, so aren't great for storing the likes of pixmap images or arbitrary binary blobs. And the syntax can be problematic in some ways, that records fix.

I'm looking forward to finding an excuse to implement IRON, at least as a prototype, on top of existing programming environments. Certainly, such implementations will be useful in their own right, just as JSON and friends are; but until I have a need for such a thing, I can't really spare the energy to make one right now! But my reason for wishing to do so is that things like notations tend to be hard to analyse 'in the lab' - you need to try expressing some real-world data in them, and using them on a daily basis, to really learn the pitfalls!

But I'm hoping that, with a little experimentation and polishing, I can make IRON a good written format for all sorts of machine-oriented information - that's interoperable with an efficient binary format.

3 Comments

  • By Tom Novelli, Sun 2nd Aug 2009 @ 9:48 pm

    You've put a great deal of effort into documenting the myriad parts comprising a real-world global computing environment, and how they all fit together.

    You're just using the chemical names as placeholders until you work out the boundaries between parts and which parts are actually needed... right? 🙂 I'd find it a lot easier to follow along if the pieces were referenced something like this:

    H hardware abstraction; He executive; Au GC~~; Fe data model; Cr HLL; Li dispatcher; Cs job scheduler; Ir transport; Hg RPC~~; Wolfram cluster communication; W transactional distributed store; C naming; I identification; Ne UI toolkit; Fl gateway;

    Also, I think you'll need a few more parts... O global/galactic/universal data store; Pu FFI 🙂

  • By alaric, Mon 3rd Aug 2009 @ 8:59 am

    You've put a great deal of effort into documenting the myriad parts comprising a real-world global computing environment, and how they all fit together.

    In other words, I'm mad 😉

    You're just using the chemical names as placeholders until you work out the boundaries between parts and which parts are actually needed... right? 🙂 I'd find it a lot easier to follow along if the pieces were referenced something like this:

    I dunno really; I've not given too much thought to the longer-term naming plan!

    The story of the theme is this: I was sitting there wondering what a good name for the project would be, when a big gas tanker lorry drove past, with "ARGON" written on it, and I thought it sounded good.

    And then just as Java wrung the 'coffee' theme dry, I decided to do the same with the periodic table 😉

    Each name has some relevance to the function it performs. Hydrogen is really the building-block from which all other atoms are made (plus a few neutrons to hold it together); Helium is just a small thing that builds on Hydrogen. Gold has connotations with quota management, as it's used as a token to represent value. Iron is a bit more tenuous, but it's sort of a strong, stable, framework to tie large structures together with, and there's the possibility of a rather bad pun on "an ion" being a serialised Iron value (analogous to Python's pickles). I can't remember what made me choose Chrome for a language, probably just some metaphor of putting chrome-plating on top of the low-level hydrogen VM to make it easier to program. Lithium is the framework for reacting to events, that drives the high-level behaviour of the whole thing, since Lithium is a very reactive metal. Caesium from its use in atomic clocks, thus relating to scheduling. Iridium arose because it's between Mercury and Tungsten in the periodic table, and it's the common functoinality shared between the two; Tungsten has connotations of strength and inviolability, as it's alloyed into steels to make them tougher and has a very high melting point, so it seemed a good name for a replicated data storage system; while wolfram (even though it denotes the same element) is a nod to the Bible of clustering, "In search of clusters", which uses a theme of packs of wolves versus the multi-headed dog Kerberos to explore SMP vs. clusters, and various cluster-related software packages since have made homages to it. Carbon as the global knowledge base, as carbon is the structural skeleton of most large molecules, with other atoms hanging off of the side; so the idea of a tree structure made of Carbon with bits of Iron hanging off of the leaves appeals! Iodine started off as a set of standard interfaces for people (and person-like objects) within the system, so was indeed originally about things like identification (and the chemical relevance being purely that the symbol for Iodine is "I"), but as I tinkered with it, I realised that the set of properties that people share with other objects such as assorted bits of hardware and software (now there's a cynical statement...) was quite large, so it was becoming increasingly unrealistic! Neon for UI stuff is simply a reference to neon signs being brightly coloured and attracting the attention of people to tell them stuff, and Fluorine for gatewaying to the outside world because Fluorine has a very electronegative ion that binds aggressively to other atoms.

    But yeah, I should make the naming clearer. I can modify my WordPress plugin that puts the links in to do that, actually... in the text, I just write HELIUM and it does the rest (similarly, all the Wikipedia links are C (programming language) et al). Good point, thanks!

    Also, I think you'll need a few more parts... O global/galactic/universal data store;

    C will do that. Names are just one kind of data, that happens to string the other kinds together into a meaningful whole.

    Oxygen implies a kind of live-giving reactivity... perhaps Oxygen can be the name of a drive to actually implement things, rather than a software component!

    Pu FFI 🙂

    I was thinking radioactives like 'Pu' would be the unstable release candidate series 😉

  • By Tom Novelli, Fri 7th Aug 2009 @ 9:14 pm

    Glad I asked... makes more sense now. That explanation would be a nice addition to your ARGON Implementation Roadmap.

    I've been thinking about OS design a lot lately, and typing up notes in a CouchDB wiki-type-thing I'm fooling with. If we ever get around to it, it might be good to combine our materials in a distributed knowledge base to get a feel for how such things work in practice. (We could've used that in the Tunes project about 5 years ago 🙂

Other Links to this Post

RSS feed for comments on this post.

Leave a comment

WordPress Themes

Creative Commons Attribution-NonCommercial-ShareAlike 2.0 UK: England & Wales
Creative Commons Attribution-NonCommercial-ShareAlike 2.0 UK: England & Wales