HYDROGEN: Bootstrapping (by )

As promised, here is the third part of my series on HYDROGEN, where I will discuss bootstrapping.

Introduction

The way the application written on top of HYDROGEN actually gets to run in the first place can vary widely between different platforms. Let's take a look at some different classes of boot process I've designed HYDROGEN to be able to deal with.

POSIX

When running as a POSIX process, the boot process is simple; there will be a HYDROGEN runtime as a native executable file, compiled from the HYDROGEN runtime's C source code. This will be invoked and passed a path to a virtual machine image directory on its command line, which will contain:

  • A configuration file with virtual hardware details: which raw disk devices or files to make available as mass storage devices, how to provide the HYDROGEN console device (via stdin/stdout, via a TCP socket you can telnet to, via X11, not at all), whether to go into the background and run as a daemon, and so on.
  • An optional configuration file with runtime tuning parameters required by the implementation such as details of how much memory to allocate, how many POSIX threads to start as virtual processors if the default isn't good, etc.
  • The HYDROGEN application itself, as HYDROGEN source code to feed to the interpreter
  • A bootstrap configuration file for the application (discussed later)

So the runtime will load the configuration files, set itself up, then interpret the HYDROGEN application which will cause subroutines to be compiled into RAM and data structures to be created, then the application starts to run, and we're away.

The bootstrap process defines a definite boundary between the application's HYDROGEN source code being loaded and interpreted, and the actual running of the application; the result of interpreting the application's source code is purely to set up the initial memory image. The application must define a word called start, which HYDROGEN actually invokes to run the application. Requiring that control pass back to the HYDROGEN runtime between these two phases serves several purposes.

For a start, it simplifies multiprocessor initialisation, since the initial load phase happens on a single processor. Once it is complete, any others processors may then be fired up, and all of them can run the start word; there is a device:cpu:master word that pushes a boolean onto the stack, which is true if the code is running on an arbitrary but unique "master" CPU, which can be used to elect a CPU to initiate whatever the application requires. It's guaranteed that precisely one CPU is master, so an inherently single-processor application can just start by halting the CPU if device:cpu:master is false, while multi-processor applications can make processors that are not masters wait for the master to assign them tasks.

Secondly, it allows an optimisation. Interpreting the application on every boot can be costly. It's perfectly legal for the HYDROGEN runtime to, after the load is complete but before invoking start, and snapshot the resulting memory image to persistent storage. On subsequent boots, if the HYDROGEN source for the application has not changed, then the pre-generated image can be loaded immediately. In order to support this, it's specified that the results of calling words that query potentially variable aspects of the hardware is undefined while the application is being interpreted; the application can check for the potential presence of certain kinds of hardware by using the require and feature? words, and use that to guide conditional compilation of code using those features, but they shouldn't ask how many network interfaces or mass storage devices or whatever there actually are until start is invoked.

In particular, the device-tree system will not generate any notifications of the addition or removal of hardware until start is run. At which point, there will be an initial notification of new device insertion listing all of the initial hardware, and subsequent ones thereafter.

Self-hosting bare metal

A self-hosted bare metal system is one that's large enough to contain its own compiler, and to compile the application into RAM on startup. The HYDROGEN runtime system will, depending on the platform, be burned into a ROM of some kind for direct execution in-place, or be on an external mass storage device that is loaded by a boostrap process (which may involve multiple stages). So it will be loaded and gain control via the platform-dependent process, then it will access some kind of persistent storage device (which might again be more direct-access ROM, or an external device) to load its configuration file documenting whatever aspects of the hardware that are not hard-coded into the runtime image or able to be determined by automatic Plug-and-Play probes, the HYDROGEN application, and the HYDROGEN application's boostrap configuration file. And, as before, after configuring itself, it will proceed to interpret the HYDROGEN application, resulting in the initial memory image of the application being built up in RAM, which it will then proceed to execute.

Interestingly, a bare-metal self-hosting machine need not store subroutines in the same memory space as normal data. In the discussion about code generation I explained that code generation involves creating a compiler context, feeding it VM operations, then sealing it to get a "handle" through which the subroutine can be invoked, or calls to it compiled into other subroutines. Then in the discussion about extensibility I discussed memory allocation, and in particular pointed out the heap stream interface for building variable-length objects; and the observant may have noticed that heap streams seemed a logical implementation choice for the compiler building up a native-code subroutine. However, at no point have I declared that subroutines must exist in addressable memory. Indeed, in the discussion of memory allocation, I mentioned that it might be impossible to place all of a computer's memory into a single address space, as the physical address size may be larger than the virtual address size; in such cases, it may be useful to have entirely separate code and data spaces, with subroutines being compiled into code space. In which case, a custom memory allocator for code space is required to handle allocating and freeing subroutines. But this can all be hidden behind the APIs presented, and applications need not be aware of this.

Not only can this increase the address space available, it can improve security, by making it harder for buffer overflows to inject shellcode.

But some platforms may require this - the Harvard architecture computers have independent physical memory units for code and data memory.

And we're away.

Tethered environments

Tethered environments are a little trickier, however. By a tethered environment, I mean a bare-metal system where there isn't the available RAM, ROM, or runtime for compiling an application written in HYDROGEN into native code on boot. So the platform depends on a 'master' machine to which it is 'tethered' (literally or figuratively).

Luckily, the fact that the initial interpretation of the application's HYDROGEN source code and its actual execution by calling start are isolated means we can handle this. In summary, we can interpret the source code in an emulated extended version of the platform on a host machine, and thus create a memory image that can be downloaded into the tethered platform - either directly by loading it into RAM or flashing it, or indirectly by producing a ROM image that is then burnt into potentially many systems.

This means that the actual runtime environment on the target platform can be trimmed down - it might not need the runtime code generator or interpreter, for example. The image builder on the host should keep track of which primitives are actually used in the resulting image, and trim down the runtime as required.

How the host image builder works is an implementation dependency, but for many platforms, it will probably be appropriate to directly build up the initial RAM map, as the pointers of objects will need to be hardcoded into things, but to build up subroutines in a more abstract form. For a start, some subroutines may only be needed during the initial interpretation, so they needn't end up in the resulting image; indeed, unless we have an emulator of the target's native code on the host, we'll want to store those subroutines in a form that can be easily interpreted. If they are stored as lists of basic VM ops, and handles are just indices into a table, we can produce the final image by finding the start word in the wordlist, then tracing recursively through the subroutines it references until we get to primitives, compiling them to the target's ROM image as we go.

The Boot Database

I've taken an unusual step in somewhat standardising the shape of the boot process, in order to make a standard interface for HYDROGEN applications to update themselves (and the HYDROGEN runtime beneath them), in a manner analogous to an operating system updating its boot loader and kernel. Almost every OS outside of the embedded world has a way of doing this, but they're usually relatively platform dependent.

We have words that tell us what platform we're on (x86_32-openboot), what implementation we're running (Harry'sHydrogen) and the version(2.8), and the identity of our system (such as a serial number); the platform identifier lets us know what implementation of HYDROGEN we support. A HYDROGEN implementation image can then be bundled into a sequence of bytes, starting with a header that contains a platform identifier and an implementation identifier and version number. If the platform and implementation identifiers match our own, then we can opt to install that implementation; and it's up to the implementation how that's handled. So if the implementation provides the boot-media feature, then the boot-media:install-kernel word can be used to install an implementation image.

A POSIX-hosted implementation might be based around distributing a binary for the particular platform. In which case it might have platform identifiers such as "Linux-RedHat5.2-x86_32", and the HYDROGEN implementation image will contain a compressed ELF executable and the HYDROGEN source code for a standard library loaded before the application, which can then be installed. Or it might be based around compiling source, in which case the platform identifier might just be "POSIX", and the implementation image will be a tarball full of C source, which is compiled with the local C compiler and the result then installed.

A bare-metal implementation might have the image contain a boot sector, second-stage boot loader, native-code HYDROGEN kernel and HYDROGEN standard library, the first two parts of which would need installing into special locations on the disk, and the latter two parts of which could then be placed into a small boot partition on the disk.

Implementations that use FLASH might be able to re-flash themselves, and so on.

Likewise, implementations may require a system-specific configuration file, as well as their HYDROGEN runtime that will be shared between all systems running that implementation on that platform. The format of the configuration file is implementation dependent, but it ought to be some kind of plain text. And so the boot-media:install-kernel-configuration word can be used to install a new configuration file. Interestingly, the configuration need not just be stored verbatim in some persistent storage area; it could be parsed and different parts of it stored in different places. Some platforms might require that some configuration be placed into NVRAM as it's configuration for the platform boot ROM rather than for HYDROGEN itself.

But we also apply a similar process to the HYDROGEN application itself. Although the application can store its own state in mass storage devices, the way it organises the contents of storage devices is up to it; so how to load the application in the first place needs to be handled specially. I've mentioned previously that the HYDROGEN application consists of a single block of HYDROGEN source code that is interpreted, plus some configuration file (the distinction being maintained so that the same HYDROGEN application can be rolled to many machines, but then given slightly different configuration). And so we have boot-media:install-application-component, which takes a slot number and a text string to store in that slot number. Slot number 0 is the application, and 1 is its configuration file.

(Actually, there's no enforced distinction between application and configuration; it's just a series of strings identified by slot number. Loading the application consists of interpreting them in order. So the configuration has to be HYDROGEN source code, and the application might be split into multiple parts that can be updated independently, etc).

Initial installation

That's all well and good, if you're already running your HYDROGEN app so you can use the boot-media words. But how to install it in the first place? Well, that's platform dependent - although we know that the inputs to an initial installation are an implementation image, runtime configuration, and an application and its configuration, the way that initial state is taken to a new system varies. It might need to be bundled with an installer onto a USB key or bootable CD (or even a floppy or magnetic tape!), or it might be installed via netbooting, or through a special tether. So we don't try to specify how to do this!

Hopefully, different implementations that target the same platform will work in such a way that it's easy to migrate between them. If they have the same general convention for where they store their boot media data, even if the format is different, then changing implementation will just mean reinstalling the kernel, its configuration, and the parts of the application, potentially reformatting a boot partition into a new format; hopefully, the rest of a mass storage device won't need to be touched, meaning the new install will see the same mass storage partitions with the same contents.

Conclusion

In practice, these features are aimed at making it easier to manage a cluster of machines. If you have a database of some kind with a set of implementation images for your available platforms, a single application image, and for each machine, kernel and application configuration files, then it'll be easy to ask a machine what its platform and its serial number are, and thus know what files to install upon it. Automated means for rolling out new release of things and configuration changes can thus be developed without getting mired in implementation details.

No Comments

No comments yet.

RSS feed for comments on this post.

Leave a comment

WordPress Themes

Creative Commons Attribution-NonCommercial-ShareAlike 2.0 UK: England & Wales
Creative Commons Attribution-NonCommercial-ShareAlike 2.0 UK: England & Wales