HYDROGEN: Code generation (by )

As promised, here is the first part of my series on HYDROGEN, where I will discuss code generation.

Introduction

Most operating system portability layers use a batch compiler - almost always gcc - to convert the OS kernel's source code to a platform-specific native code file that is installed somewhere a boot process can get at it. Which works, but batch compilation is limiting as it precludes all sorts of interesting run-time code generation tricks, and nobody likes compiling kernels. Particularly in large clusters, you need infrastructure to manage compiling kernels for all your platforms and making sure each machine gets the correct kernel.

Obviously, given that existing hardware boot processes expect some native code, there will always be some element of pre compilation and managing native code, but we can help a bit by pushing it down as low as we can. And the core of HYDROGEN is that the HAL itself, as part of its management of the bootstrap process (to be covered in more detail next), includes a runtime code generator.

This makes things a bit simpler by putting more platform-specific stuff in one place; the same body of code that handles the bootstrap and system-initialisation process also contains the platform-specific code generator, in one 'lump', which reduces the management overhead somewhat. As I shall describe in the next post, a system of platform identifiers will prevent a HYDROGEN implementation that's incompatible with the platform being installed anyway.

Once the HYDROGEN kernel is loaded, the actual OS can be written in HYDROGEN portable code, and the process of loading it will compile it to native code in RAM for actual execution.

But the HYDROGEN compiler is then made available as a runtime service - so that kernel modules such as device drivers can be loaded dynamically, and application code too. Nothing above that layer need be platform dependent (but it still can be; HYDROGEN permits the provision of platform-specific interfaces, such as assemblers and low-level device driver access).

However, we don't want the full weight of something like gcc as part of our bootstrap. HYDROGEN's approach is more like Tao's Virtual Processor; a lightweight virtual machine implementation takes code written in a low-level representation, designed for easy efficient compilation, and compiles it to native code or interprets it. However, unlike many VMs, we do not just provide a bytecode format; instead, we use a low-level model that is equivalent to the sort of virtual machine model exposed by a bytecode. However, rather than defining a format for describing a series of virtual machine instructions, we instead define a simple interpreter on top of the model, that takes source code written in human-readable text, and provide that interpreted language with primitives to compile virtual machine operations. These compilation primitives may generate native code, or may actually generate a bytecode in some format that is interpreted - but that's an implementation detail.

The strengths of this approach are many:

  1. The source code is readable by humans, rather than an inscrutable bytecode format that, again, requires external batch compilation
  2. The interpreter may be made available from the console, as a means of debugging boot problems. If the kernel won't load, the HYDROGEN bootstrap process can drop the user to a command line to try and recover the situation. This means it can also function as a kernel debugger.
  3. As we are running a program that uses an API to generate code, rather than just reading a static description of the code to generate from a file, we can metaprogram, building reusable code-generation tools in the interpreted language so the programmer is not forced to deal with the low-level virtual machine. In fact, unless the programmer is doing some particularly low-level metaprogramming themselves, they will never directly generate VM operations; because the first thing the HYDROGEN standard libraries do is to define a compiler from the interpreted language to the VM, so that code to be interpreted immediately and code to be compiled can be written in exactly the same language.

The last point there is inherited straight from FORTH, where the interpreter and the compiler are intimately intertwined; I separate them slightly. Another property HYRDOGEN inherits from FORTH is that the virtual machine is based around stack manipulation, which makes certain things a lot simpler!

Pages: 1 2 3 4

3 Comments

  • By Gavan, Thu 16th Jul 2009 @ 11:53 am

    Do you have any mechanism for assuring that a certain bit of code which will be used later must be compiled by a certain point in the code?

    I can think of several cases (mostly within device drivers) where the latency of certain routines is critically important, and having to wait (even the first time) for the parser to do its thing could lead to lots of unpleasantness.

  • By alaric, Thu 16th Jul 2009 @ 12:35 pm

    The definition of a subroutine with ( ... ) compiles it there and then, in the implementations I have planned (except for the case of tethered systems, but you'll have to wait to hear about them). Either way, by the time you call a subroutine, it ought to be compiled.

    It's all up to the implementation, though - weird JIT stuff could be done. It's just that those implementations would suck for real time stuff, and should say so on the tin 🙂

  • By alaric, Fri 17th Jul 2009 @ 2:07 pm

    This is also interesting reading:

    http://factor-language.blogspot.com/2009/07/improved-value-numbering-branch.html

Other Links to this Post

RSS feed for comments on this post.

Leave a comment

WordPress Themes

Creative Commons Attribution-NonCommercial-ShareAlike 2.0 UK: England & Wales
Creative Commons Attribution-NonCommercial-ShareAlike 2.0 UK: England & Wales