HYDROGEN: Code generation (by )

The gory details

The HYDROGEN virtual machine is very simple. It has some memory, and a number of CPUs, that can all access that memory. It's assumed that addresses are portable between CPUs; highly NUMA systems are probably best partitioned into a number of independent virtual nodes and the memory interconnect used as a fast message passing system - HYDROGEN assumes SMP within a virtual machine, as making efficient use of a NUMA system in a portable way is an ongoing research topic, while both SMP and message-passing parallelism are well known.

Each CPU has a few virtual registers (an instruction pointer, an interrupt priority level, and two pointer registers), and two logical stacks - the data and return stacks.

It's possible that the data stack may be implemented by more than one physical stack, by type; it's guaranteed that all the integer/pointer/character values on the data stack go on the same stack, but floating point values may go on a separate physical stack. In practice, this means that code must not push a float then perform an integer operation, or the results are undefined; that integer operation may read the internals of a float from the integer stack, or it may read whatever else is on the top of the integer stack if the float is on a different stack.

The return stack is simpler; not only is it guaranteed to be one stack (as it only supports a single data type), it's also guaranteed to live in memory, as you can use it to allocate temporary storage that you can then access via pointers. It maps pretty directly to a processor stack, while the data stack maps more directly to processor registers, with overspill to memory.

The place where this becomes complex is in stack manipulation, so we provide some tools to deal with that portably, which I'll cover later.

The operation of the virtual machine compiler is simple. A compilation context is created to compile a subroutine, and then operations can be compiled into that context. Only two types of operations are needed: pushing literal values onto the data stack, and calling other subroutines. When the subroutine is complete, the compilation context is sealed, and the handle of the new subroutine is returned.

Control structures like loops and conditionals are done in good old lambda-calculus style: create a subroutine for the code inside your control structure, push its handle as a literal, then compile a call to a subroutine that implements the control structure, doing an indirect call to the pushed handle to invoke the code within. That indirect call is a matter of invoking a primitive subroutine that pops a handle from the data stack and calls the subroutine it points to. We can get away with such a trivial VM by having primitive subroutines that can do things that can't be implemented directly in the VM itself.

A very naive implementation would use pointers to subroutine entry points as the handles, and would compile literal PUSH and CALL instructions. But we can do lots better than that! Instead, we can compile subroutines to little structures in memory, with a header that contains some flags and then pointers to both an optimised native code implementation of the subroutine, and a 'bytecode' form of it as a log of the pushes and calls that made it up. Primitive operations can be flagged in their headers, and the compiler can then spot calls to them and directly inline native code to perform the required operation. Calls to non-primitive subroutines can be inlined, too, either by relocating the already-compiled native code or (more flexibly) reading the 'bytecode' and issuing the recorded PUSHes and CALLs directly to the compiler - for our simple subroutine calling model allows inlining through direct textual substitution, thanks to the FORTHy stack model.

What about those control structures? Implementing every conditional or loop as an indrect call is a bit inefficient - so make them compiler primitives; if the last instruction was a literal PUSH, then a CALL to IF means that the literal just PUSHed must be the handle of the subroutine that's the body of the IF. So directly compile a conditional, and inline the body from the subroutine. But an IF that's not given a fixed subroutine can still use the indirect call, thus preserving the full dynamic power of our IF. And the same logic can apply to all the other control structures.

It's quite easy to build a peephole optimiser for FORTHy code; simply feed your primitives into the front of a FIFO of sufficient depth to detect the longest pattern you want to replace, then when the FIFO becomes full, examine the back of the FIFO against a set of template patterns (such as, "literal PUSH followed by call to IF"), each of which has its own special compilation logic. If none of them match, then fall back on the simple implementation of the single operation at the back of the FIFO: push a literal, or call a primitive. Many templates will just rewrite the back of the FIFO, then repeat the optimisation process, so that optimisation ruless can be combined; PUSH 0, PUSH x, CALL if would represent a conditional that is never taken, so should optimise away; while PUSH <non-0>, PUSH x, CALL if can be replaced by CALL x; but the literal constant being pushed at the start of either rule might arise from a sequence like PUSH 0, PUSH n, CALL and that we can replace by PUSH 0.

And the implementation needn't be strictly stack-focussed, either. Physical CPUs with lots of registers can benefit from having a virtual stack mapping stored in the compilation context. The subroutine might start with an agreed calling convention that, perhaps, the top two or three stack elements are in certain registers, and the rest on the physical stack, with the remaining registers marked as free in the initial mapping; but when a PUSH is compiled, you can just pick a free register, output code to load the constant into that register, and update the mapping to record that the top of stack is now in that register, and all other registers caching stack contents have their indices into the stack shifted down by one. Primitives such as + can then spot that the top two stack elements are already in registers, and just generate code to add one onto the other, and then update the stack mappings to reflect the new state of the stack.

Of course, the sequence PUSH n; CALL + would never get through the peephole FIFO intact - it would be compiled straight to ADD <TOS>, n where <TOS> is the register holding the top of the stack.

In fact, by storing the sequences of basic VM ops in each compiled subroutine object, you can do arbitrary whole-program optimisations - at the topmost level, when the "main subroutine" for a program is compiled, then following the chains of subroutine calls to find the CALLs inside them to other subroutines all the way down to primitives, the entire program can be accessed and reasoned about.

So that's the core VM; but what of the actual language built on top? Well, the interpreter is very simple, and like FORTH, it's build around the concept of a dictionary. The interpreter has a context, just like the compiler; this context contains some basic parsing state, and the head of a linked list of "word definitions". The interpreter is given a source of characters, from some input device; it starts by ignoring characters until it finds a non-whitespace character, then it reads characters into a buffer until it finds a whitespace character, thus parsing a single word from the input stream. It then checks to see if the word matches syntax for a literal to push, and if so, it pushes it onto the data stack (the interpreter carefully makes sure that, although it may use the stacks as it parses a word and decides how to handle it, it keeps all of its context state elsewhere when it's actually performing the action of a word, so the stacks are available for application use). If not, it iterates down the linked list of word definitions from the head pointer in its context until it finds one whose name field matches the name in its buffer; if it doesn't find one, then it calls an error handler specified in its context. If it does, then it examines the contents of the word header to find out what to do next; inside the header are, amongst other things, two subroutine handles called parse and run. The interpreter pushes the pointer to its context on the data stack then calls parse which can use the interpreter context to read characters from the input stream to parse any special syntax, then to push the results onto the data stack. The run handler is invoked next, which should perform the actual action of the word.

A whole bunch of standard words is provided as a standard library. Most of them have a parse handle that just discards the interpreter context pointer and pushes nothing, as they do no special parsing; words like + will just have a run action that points to the + primitive subroutine.

However, one very interesting word is (, which invokes the compiler. The compiler's parse operation creates a compilation context, then proceeds to read the interpreter's input stream in much the same manner as the interpreter, but rather than performing the actions of words, it does something different. If it finds a literal, then rather than pushing it on the stack, it instead compiles a literal push. If it finds a word in the wordlist, then it pushes the interpreter context and calls the word's parse subroutine to make it parse any extra stuff it needs; then it pushes a pointer to the word definition and a pointer to the compiler context and calls the word's compile subroutine, which for most words just compiles a call to the word's run subroutine. If it encounters the special word ), then it seals the compilation context, and pushes the resulting subroutine handle onto the stack, replacing the interpretation context. In other words, when ( ... ) is parsed, the result is that the code between the parentheses is compiled, and the resulting handle pushed; the run subroutine of ( is a no-op, so the effect of ( .. ) in the interpreter is to compile a subroutine and to push its handle onto the stack. So ( ... ) if (with the if word bound in the standard wordlist to the if primitive discussed above) will conditionally execute the code in the parentheses.

But what happens if we encounter ( when compiling? Well, ( has a different compile subroutine to the default discussed above. Bear in mind that, for each word, the compiler calls parse then compile, and the parse subroutine of ( leaves the handle pushed onto the stack. The compile subroutine of ( reads the handle that parse pushed, and compiles a literal push of it.

So the effect of ( ... ) in the compiler is to compile the code between the parentheses into a subroutine, and then to compile code into the "current" subroutine that just pushes the address of the new subroutine.

Which means that ( ... ) if, in compile mode, if the peephole optimisations discussed above are enabled, will result in the code in the parentheses being compiled surrounded by logic to jump over it if the condition is false.

This means that, in practice, the interpreter won't actually be used all that much. Certainly when loading software, the interpreter will generally just execute a long sequence of calls to the compiler via (, then invoking a word that turns the newly compiled subroutine into a new word. There's no way to create a subroutine that is, itself, interpreted (we could create one, but why bother?) - all subroutines are compiled, so the interpreter just goes through the source code of an application from top to bottom invoking things once, and the end result will usually be a load of new words defined into a wordlist, then a top-level word invoked that actually 'runs' the application; the interpreter will only do much "real work" when being used by a human as a command prompt for low-level debugging.

Note the elegant separate of concerns; by having a very, very, simple VM (with just PUSH and CALL instructions), we have been able to define a very simple interpreter and a very simple compiler; yet the VM design allows the generation of optimal code from this simple VM; issues like forwards and backwards jumps, the scope of definitions, and so on are no concern whatsoever of the compiler or interpreter.

And, yes, you can have scopes; it's possible for parse subroutines of words to modify the word list head pointer in the interpreter context, to add extra definitions; lexical scoping is enforced, as every such defining word (except at the 'top level') must be bracketed by a corresponding end-of-scope word that removes the definition (and any other state), and checks that the top of the wordlist is "correct" to check that begin and end words have been correctly balanced. We'll see an example of that below when we consider local variables.

The main means of adding new words is through a small set of defining words. The most common one is :, whose parse subroutine reads a single word from the input to be the name of the new word; its run action is to read a subroutine handle from the stack and to create a word with that name, the default parse and compile actions, and the handle from the stack as the run action; :'s compile subroutine just raises an error, as in compiled code there is no implicit interpreter context to add new words to.

This means that you can feed the interpreter code of the form ( ... ) : foo, which will create a new word called foo that performs the specified subroutine. To define words with non-standard parse and compile actions, one instead uses define foo, which works much like : but expects three handlers on the stack - one each for parse, run, and compile.

Pages: 1 2 3 4

3 Comments

  • By Gavan, Thu 16th Jul 2009 @ 11:53 am

    Do you have any mechanism for assuring that a certain bit of code which will be used later must be compiled by a certain point in the code?

    I can think of several cases (mostly within device drivers) where the latency of certain routines is critically important, and having to wait (even the first time) for the parser to do its thing could lead to lots of unpleasantness.

  • By alaric, Thu 16th Jul 2009 @ 12:35 pm

    The definition of a subroutine with ( ... ) compiles it there and then, in the implementations I have planned (except for the case of tethered systems, but you'll have to wait to hear about them). Either way, by the time you call a subroutine, it ought to be compiled.

    It's all up to the implementation, though - weird JIT stuff could be done. It's just that those implementations would suck for real time stuff, and should say so on the tin 🙂

  • By alaric, Fri 17th Jul 2009 @ 2:07 pm

    This is also interesting reading:

    http://factor-language.blogspot.com/2009/07/improved-value-numbering-branch.html

Other Links to this Post

RSS feed for comments on this post.

Leave a comment

WordPress Themes

Creative Commons Attribution-NonCommercial-ShareAlike 2.0 UK: England & Wales
Creative Commons Attribution-NonCommercial-ShareAlike 2.0 UK: England & Wales