Snell-Pym

Intermediate languages (by alaric)

One way of making a compiler portable to different host platforms is to make it compile to a processor-independent (or mainly processor independent) low-level language.

GCC, perhaps the most widespread compiler of them all, uses this approach: it produces mainly processor-independent "RTL" (register transfer language), which is close to assembly language in general, but not aligned to any particular processor's implementation thereof. I say mainly processor-independent since earlier stages in the compiler do access target details and produce different RTL for different targets.

This is OK, but it does have a downside: you can't compile to RTL once and distribute the RTL to lots of people and have them then finish the compilation on their end machine. This would be handy for a few things, such as small target devices where the room consumed by the GCC front end stages would be sorely missed (and the time taken to run them sorely missed, too). And it complicates the bootstrap process a bit.

On the other hand, another widespread example is the Java virtual machine, which is highly standardised. A java compiler outputs .class files full of JVM code, and then the java runtime compiles them upon loading, or just interprets them. The runtime is the only platform-dependent part.

Now, there are some disadvantages to this approach; sometimes, the earlier stages of compilation can benefit from knowing a thing or two about the final architecture. In general, it's easier to optimise when you're going straight from a high level language to assembly language, with every step along the way customised for the source language and host architecture.

However, this is a matter of choosing your intermediate language carefully, so that it's at the best intermediate point between being too tied to a particular model of high level language and being too tied to a particular model of assembly language, so that the most variety of each can be efficiently accommodated.

And there can be get-outs - if one includes some level of macro system in the intermediate language, then the front-end compiler can, in cases where it would like to know something about the target platform, produce code for several different options (usually there's only two or three choices), wrapped in appropriate conditionals so that only the appropriate one is finally used.

One thing that really helps is making the intermediate language be more like a script that is run to generate assembly language with a library of generic code generation primitives (plus optional ones that not all platforms can or will implement, etc). Then the front end can emit scripts that make complex decisions at final code generation time. FORTH is like this, and the model can be abstracted away from the stack nature of FORTH and used for other systems (although I've not seen this done yet).

Now, when designing an intermediate language, you tend to need to define an abstract model of the processor, which different physical processors can be instances of. For example, FORTH considers the CPU as a pair of stacks, an instruction pointer register, and a mechanism that executes a few basic types of instructions (literal pushes, calls to user-defined words, calls to native core words, conditional jumps, etc) in sequence. The JVM is something similar. The Parrot virtual machine, the Tao Virtual Processor, and GCC's RTL all consider the processor to be a bunch of registers (either a fixed number, or an arbitrary pool), along with instructions that take their arguments from nominated registers and put the results into nominated registers.

Either way, the implementation in terms of a specific CPU faces two main challenges.

It has to map the intermediate language's storage model (stack or registers) to the actual registers of the real CPU. There are three basic approaches to this:
1. for a stack-based intermediate language, you can have a static register allocation, including setting aside one or more registers as caching the top of the stack, then have the stack in RAM. The other registers can be stack pointers and other important global values, workspace area for intermediate results, etc. However if there's a lot of physical registers available, this often doesn't make the best use of them; you get far more memory traffic to that stack than you really need.
2. For a register-based language, you can again have a static allocation. If your CPU has more registers than your intermediate language, then it's trivial. If it has less, then you put some into real registers and some into a thread-local region of overspill memory. If the intermediate language's CPU model doesn't specify that some registers may be faster than others then this is a bit lame, but it's possible to define your register model such that (for example) the higher the number of a register the more likely it is to be in RAM rather than on-chip so the less expectation the front-end can have of that register being "fast", making it the front-end's responsiblity to try and use the lowest-numbered registers it can.
3. For both kinds of intermediate language, you can just analyse the code to extract a data/control flow graph, and then use standard register allocation algorithms to implement it. In which case, whether it's a stack or a register-based system becomes fairly immaterial, except that systems using a stack or an unbounded register file will more easily produce more optimal code; an intermediate language that specifies that there are a fixed number of general purpose registers would force a front-end compiling some code that has more than that number of intermediate values to generate extra code to spill some data into memory to make room in the registers. If then implemented on a physical CPU with more registers, it would take more analysis on the part of the final compiler to detect this and use actual physical registers instead. And on a CPU with less registers, there may be the embarrassing case of the explicit load/saves then being implemented as moves from the register spill memory location chosen by the front-end compiler and the register spill memory location chosen for the virtual register by the back-end compiler, again unless the back-end compiler goes to extra lengths to identify and undo this.
It has to map the intermediate language's set of basic operations (adds, subtracts, etc) to the actual instructions of the real CPU. At first sight this is trivial, since the basic core instruction sets don't vary much between CPUs, but there tend to be special features of each CPU that are not present in the intermediate language so never get used, which is a waste when you want to get the best performance. So you have to perform complex pattern-matching operations upon the input code. Say your intermediate language is a load/store architecture, where adding the contents of two memory cells involves three instructions: two loads of the memory cells, then an add. But if your core CPU is something like an x86, you can do that in two instructions: one load, then one add-memory-to-register. To get the best performance, you need to analyse the intermediate language to spot that a register is loaded from memory then used as the input to an add, and merging those two instructions. And when you get to things like the x86 LEA instruction that can do fairly complex sets of multiplies and adds all in one go, you need to be looking for some pretty complex patterns. This kind of analysis is most easily performed upon a representation of code as a data/control-flow graph, although if you stick to a register or stack model, the stack model is very easy to implement a large number of simpler optimisations with (in general, you can implement it as regexp-like rewritings of the instruction stream; A @ B @ + -> A @ B @+ where @+ is an add-contents-of-memory-to-top-of-stack instruction. However, more complex structures become harder to identify in all cases.

So if you're going for the bestest backend implementation possible, you probably want to just turn the input code into a data/control flow graph, and whether it comes in defined in terms of stacks or registers is pretty irrelevant. Models with a fixed number of virtual registers are just silly, and should be avoided.

In which case, you might as well go for a stack model, since it tends to produce a more compact intermediate form; you'll get a bit of data compression for free. The "downside" of a stack model - that you sometimes need operations that just move things around on the stack and do no computation - isn't an issue, since those operations dissolve away into nothing when turned into a data flow graph.

And the other advantage of a stack model is that you can really really quickly produce a naive but still quite efficient implementation. FORTH is famed for being implementable on a new platform in very few man-days. This is handy for getting started: you can hand-code a stack VM implementation in assembly language on a new architecture, and thus get your system running happily. Then later you can write a more complicated optimising implementation, and get something in the region of twice the performance or more, depending on how many registers and how many powerful instructions the target CPU has to take advantage of. And if you want an intermediate step, you can always use the search-and-replace peephole stack-code optimiser I mention above.

So, I think, overall, stack-based intermediate languages win - but not by a huge margin.

ARGON | alaric | Fri 11th Apr 2008 11:12 am

1 Comment

By alaric, Sat 12th Apr 2008 @ 2:15 am

As a side note: the desire to easily construct data/control-flow diagram representations of a stack language implies that one should totally disallow the already somewhat questionable practice of defining FORTH words that have variable stack pictures - eg, that do not move the stack pointer(s) by a constant amount on each execution... Otherwise, as well as it becoming difficult or impossible to extract the data flow within a word definition, it becomes equally hard to compute the overall stack effect of a word during its definition and store that in the dictionary to allow words that use this word to have THEIR data flow analysed.

Standard FORTHs usually let you define such words, but most FORTH programmers don't, since it can confuse a human reader as easily as it can confuse the implementation. One notable exception is when reporting errors (pun intended); some words return a boolean at top of stack. If the boolean is false, the operation failed. If it's true, the operation succeeded, and the results are then underneath the boolean. This allows for programming style such as:
```
DB-LOOKUP IF
." Found: " . THEN
." Nothing found" ELSE
```
The . in the success arm of the conditional takes the number found in the database (and placed under the boolean consumed by the IF) and prints it, while the failure arm just prints a static message and consumes nothing from the stack, since there's nothing there.

I see this as an argument for implementing exception handling, and then writing:
```
TRY
   DB-LOOKUP ." Found: " .
CATCH
   ." Nothing found"
ENDTRY
```

Sarah and Alaric Snell-Pym living in interesting times

Intermediate languages (by alaric)

1 Comment

Other Links to this Post

Leave a comment

Search

Categories

About Us

Ada Lovelace Day

Business

Family

Fictional Friends

Friends

Mind candy

Projects

The Salaric Blogs

Archives

Meta

Snell-Pym

Sarah and Alaric Snell-Pym living in interesting times

Intermediate languages (by alaric)

1 Comment

Other Links to this Post

Leave a comment

Subscribe

Search

Categories

About Us

Ada Lovelace Day

Business

Family

Fictional Friends

Friends

Mind candy

Projects

The Salaric Blogs

Archives

Meta