A new computer architecture (by )

I was in a hardware mood on a train a few months ago, so I typed up some notes about a possible different architecture for CPUs that might make good use of internal parallelism, have asynchronous control, and a high code density. The result would be an efficient CPU, but it does make interrupt handling a headache.

I then go overboard, designing a device interconnection framework for expansion! It was a boring train journey...

= UPDATE =

It occurs to me that a neat way of easing the problems of interrupt handling in highly asynchronous and parallel CPUs would be, quite simply, to not bother with them - let a dedicated I/O processor with an architecture designed for low-latency context switches handle them, and handle pre-emption of user code (eg, an interrupt saying that a block is ready for reading from the disk controller causing the code that was blocked waiting for that data to become runnable, which then preempts the current process since it has a higher priority) by allowing the I/O processor to instruct the main CPU to do so.

So the main CPU would still need context save and restore logic, but it wouldn't need to be able to nest IRQs or anything; all it would need is the ability to save the current context to RAM and then load another context from elsewhere in RAM, as an atomic operation, and it wouldn't need to be as fast about it as if it was in the critical path of interrupt handling.

Within my architecture below, this can be handled by having an (on-chip) interrupt processor which is a tiny stack-based MISC with local SRAM for code and data (shared with the main CPU). When an interrupt occurs, push the program counter (the only register!) of the MISC onto the stack, and jump to a vector in the SRAM. If the interrupt handler feels it needs the main CPU to reschedule, then it tells the CPU to switch to a context, giving it the address of the new context. The address the current context was loaded from is kept around in a register, so the CPU suspends instruction fetching, waits for all execution units to finish, then saves the contents of registers and FIFOs to the context in RAM, loads the new context pointer into the context pointer register, and then loads the new context and resumes execution. The CPU should check to see if the new context being switched to is the same as the old context (by comparing the provided pointer with that in the context register), and if so, do nothing.

Most of the time, the interrupt handlers would just cause a switch to a special scheduler context, which would work by choosing a new process to run, manually invoking a context switch to it, then looping back to the top of the code. While the scheduler context is running further requests from the MISC to reschedule will be ignored since that context is already active; this means that an interrupt causing a context switch request between (or during) the schedule algorithm and the context switch, which makes a higher-priority process runnable, may not result in that process running until the next context switch. So perhaps the CPU should handle a request to switch to the context it's already in by just reloading the context, which would restart the scheduling algorithm.

Any context switch request coming in while a switch is in process should be queued, perhaps suspending the MISC until the CPU is ready to perform the switch.

  • Next generation CPU -

The CPU is highly internally asynchronous.

CONTROL UNIT

Instructions consist of three fields: the condition mask, a source port, and a destination port. If the condition mask matches the flags register, then the control unit puts the source and destination port numbers onto the internal bus, and asserts the READ control line. The module handling the source port should notice its port number on the bus and the READ line, and endeavour to provide data to the data portion of the internal bus. When the data is stable, it should assert the HANDOVER control line and hold it (and the data bus) until READ is dropped. The module handling the destination port should notice its number on the destination bus and the HANDOVER line, and proceed to accept the data on the bus, asserting the DONE line when it is complete. If the destination module is not ready, it can hold for as long as necessary.

The DONE line causes the control unit to drop the READ line, and to signal the instruction fetcher for the next instruction word. When the instruction is ready, the fetcher signals the control unit to examine the new instruction, and the cycle continues.

MEMORY CONTROLLER

A memory controller handles access to system RAM, connected to on-die cache SRAM, external cache SRAM, and external DRAM. The memory controller serves three masters:

1) It is accessible from software to load and save data. It provides a destination port and a source port that can be used to write and read an "address" register, then a range of destination and source ports that can be used to read or write an 8-bit, 16-bit, or 32-bit value from or to memory, with or without sign extension, with or without auto-increment or auto-decrement of the address register, and with or without cache bypass - 6 bits of port address space in the source and destination ranges!

2) At a lower priority, it is used by the instruction fetcher to load instructions.

3) At the lowest priority, it is used by the state management unit to load and save state register files, to perform task switches.

ALUS

The processor can have any number of logic units. Each logic unit will have destination ports to provide the input data, plus source ports from which the result can be read.

Simple ALUs may just consist of, say, an adder wired to a pair of latches fed by destination ports, with the adder output as a source port. By the time the system signals acceptance of the write to either latch, the output is stable anyway, and reads from the source port can be immediately acknowledged.

More complex ALUs will have one of the destination ports is nominated as the last to be written to, since writing to it causes the logic unit to start processing. While processing, writes to the inputs or reads from the output are suspended until processing is complete and the result is latched on the output source port.

Even more complex ALUs may have internal pipelining, allowing multiple requests to be written to the input ports, then read from the output ports in turn. Compilers will need to be aware of the level of pipelining to avoid trying to stuff too many requests into the input, and ending up hanging.

COMPARISONS

A compare unit will provide two destination ports. When the second one is written to, the two inputs will be compared and the flags register latched with the result, before the write is acknowledged. The flags register can be used to drive conditional execution.

WORKING STORAGE

A few general purpose registers can be made available, with matching pairs of source ports (to read) and destination ports (to write) for each register. But mainly, a set of hardware stacks is used for working storage, to reduce the number of source/dest port numbers required all the time. Each stack provides a POP source port, a PUSH destination port, a DUP source port (that reads without popping), and a suitable selection of source ports that return the top stack element after a stack operation like SWAP or OVER.

INSTRUCTION FETCHER

The instruction fetcher waits until the memory controller is idle to prefetch words into a FIFO. Altering the IP register (it's accessible as a pair of destination ports (one for absolute jumps, one for relative jumps), and as a source port for inspection) flushes the FIFO.

Also, the instruction fetcher provides a source port that will pull out the next instruction word as data, so it is not executed. This is used to insert literal constants in code. As a special concession, if the control unit notices that the source port field of a conditionally NOT executed instruction is this port, it will tell the fetcher to read and discard the next word anyway, to avoid executing the data as the next instruction.

STATE MANAGEMENT

All working storage, ALU pipeline state and port data latches, the IP register in the instruction fetcher, the prefetched instruction FIFO, the memory controller address register and current operation state, the flags register, and the control unit instruction latch, are managed by the State Management Unit. The state manager has a number of duplicates of the working register set; upon entering an interrupt state due to an interrupt line being asserted, the state manager stores the number of the current IRQ level on an internal stack and switches to the register set for the new IRQ level, then sets the IP to the correct vector for that IRQ level. A special destination port is written to to signal end of the interrupt handler, causing the previous IRQ level to be restored, and the appropriate register file selected.

Having the prefetch FIFO stored when an interrupt is handled allows the system to resume execution with minimal disruption, for reduced overall interrupt latency.

One destination port per IRQ level is provided to set the vector for each IRQ level, apart from level 0, which represents normal execution of user code - the reset vector, with a hardcoded IP.

However, to enable a context switch, the state manager has two special destination ports. Writing an address to either port will cause the manager to read or write (respectively) all of the state information for level 0 to a contiguous range of external memory address, starting at the address written. Note that this can only be done from within an interrupt handler.

SYSTEM MANAGEMENT

A system management unit provides a real time clock, a programmeable one-shot countdown timer that triggers a high priority interrupt for time slicing, on-die and external temperature sensors, and a control interface for introducing a variable delay into the control unit aysnchronous execution cycle, to trade off power consumption for speed.

A special source port provides the constant value 0, to help with clearing registers, rather than having to read lots of literal 0s from the instruction stream. Other constants (1, -1) might be made available if there's a few spare source port numbers.

A special destination port, if written to, will suspend the control unit cycle until an interrupt occurs.

I/O

One or more I/O units are provided, which control external InfiniBand-style high speed serial links.

The I/O unit provides input and output FIFOs; to write a packet, write words to the output FIFO's destination port, then write an address to the address destination port, and the I/O unit will then send the data and throw an IRQ when it's done.

When a packet has been received, it throws another IRQ, and the data and source address can be read out from the receive FIFO via a source port.

The send and receive FIFOs are managed by the state manager, with a copy at each IRQ level, but the actual link transfers are done to and from hidden FIFOs. When a transmit is requested, the software FIFO is copied to the hidden FIFO. When a packet is received, the hidden receive FIFO is copied to the software FIFO. The hidden FIFOs are not managed by the state manager, so link transfers are not corrupted by IRQs, while software handling of transmit and receive buffers are not corrupted by IRQs either.

Serial links may go directly to devices (making the device address field of a packet redundant) or to a switching fabric.

The address field of a packet consists of a device address (16-bit) and a channel number (8-bit). Channel number 0 of a device is for the device control protocol. The other channels are device-specific.

An endpoint device should ignore the device address field of incoming traffic, and issue replies to the source address field supplied. It should not set the source address field of packets it originates.

Therefore, two endpoints connected directly will be fine.

However, a switch device will assign each attached endpoint device, and itself, a number, and will advertise to attached devices that it provides a range of devices, and their device information. If it finds itself connected to another switch, it will allocate a range of its own number space to the range used by the switch.

Packets from endpoint devices that are routed through the switch will have their destination fields filled in with the device number assigned by the switch. If a packet comes from another switch, it will map that switch's assigned port number to the port number given by this switch.

If a device is hot-unplugged or hot-plugged into a switch, it will inform connected devices that have registered an interest in enumeration. Other attached switches will assign an ID for a new device, noting that packets for that device should be forwarded to the switch on port N and the destination address changed to M.

So each switch manages its own address space itself, and translates to and from the address spaces of other switches.

At the electrical level, various I/O link types are defined, from 1MHz low-speed versions with cheap connectors and embedded power for mice and keyboards, up to 10GHz optical fibre links. Switches may have a selection of ports of different types.

The channel 0 device discovery protocol should tie nicely into the HYDROGEN device properties/capabilities model. Each device can provide many capabilities, each with a base channel number.

Channel 0 should also provide power management interfaces.

Keyboards will provide two capabilities: text input (UNICODE based) and "Buttons", where the latter deals with function keys (arrows, F1, HELP, ESCAPE, etc)

A mouse or joystick would provide "Buttons" as well as "Axes", where the latter provides relative and unrestricted (mouse) or absolute and limited (joystick) motion in 1 or more named axes. A scroll wheel is a third relative axis called 'zoom'. Complex axis devices may have a combination of relative and absolute axes.

A strong emphasis is placed on device intelligence. A device is expected to provide a standard channel API for each capability, with its local CPU running the 'driver' for the specific hardware. A network interface device would be expected to provide a TCP implementation at quite a high level.

The same switching and device identification model is applied from lowly mice up to mighty high speed network, video, and disk controllers, even though the underlying physical link might be quite different.

PHYSICAL ARCHITECTURE

The chassis contains a PSU, bays for internal and external disks, and a passive power backplane.

A CPU card goes into the backplane, and ribbon cables are used to link it to other cards in the backplane, and to devices in the bays.

The CPU card contains the CPU die, the external cache SRAM, the DIMM sockets or hardwired DRAM chips, CPU power regulators, and line drivers for the I/O channels hooked up to the I/O logic on the CPU die.

Backplane cards have a panel exposed to the back or front - there are slots down both sides.

PATHS FOR IMPROVEMENT

The external DRAM might, with a suitable cache coherency framework, be shared by two or more CPUs. Ideally, the two CPUs would have atomic compare and exchange doubleword instructions. Rather than sharing a bus, the CPUs should all have their external memory lines run point-to-point to a dedicated memory sharing controller, which provides special support for concurrency control primitives.

Also, each DRAM chip might have a dedicated subcontroller chip between it and the CPU(s). The subcontroller contains arbitration logic and a small MISC stack processor with its own SRAM for working storage and for program code, dedicated to bulk operations in that piece of DRAM. The memory controller would need to provide a command channel to control the subcontrollers. Memory accesses from the main CPU would be flagged as being high or low priority - preempting the MISC's access or not. The MISCs could be programmed for cryptography, image processing, string searches, and other such bulk operations.

No Comments

No comments yet.

RSS feed for comments on this post. TrackBack URI

Leave a comment

WordPress Themes

Creative Commons Attribution-NonCommercial-ShareAlike 2.0 UK: England & Wales
Creative Commons Attribution-NonCommercial-ShareAlike 2.0 UK: England & Wales