Miscarriage (from the father’s eyes) (by )

My family is the single most important thing in my life. I grew up lonely - it was just my mother and I, and I always found portrayals of "typical family life" in popular media slightly painful to watch; I wanted that bustling house, full of children, with grandparents and aunts and uncles visiting. Sure, I'm mad about creating things; I love tinkering with computers and electronics and metalwork and DIY, and designing things around a table with my friends, but my biggest and best creation is my family.

So, I was delighted when, a couple of weeks ago, Sarah decided to do a pregnancy test; and it came out positive. We'd put if off for a while; we had some false starts when conceiving Mary, so we didn't dare get our hopes up too soon. We waited until it looked like the periods were definitely staying away, when it started to feel like if Sarah was pregnant we'd best be getting set up with a midwife and all that. I'd already been rather hopefully resting my hand on Sarah's belly on the sofa; but once the test came back positive, it was time to start snuggling down and talking to her baby. Partly soppiness on my part, and partly because I'm told that even after the first few weeks, babies start to learn their parents' voices (and it's never too early to start learning Lojban).

However, we only had a few days of that before things started to look a little wrong. I kept talking, telling the little grain of rice that there was lots to live for; it would have two siblings, three cats, two chickens, a mummy, a daddy, and a lovely house to live in, and I had so many wonderful things to show it. But the poor thing was probably already dead by then; a few days passed before everything got a bit medical, and I was carrying a bowl full of chunks of womb lining while a nurse wheeled Sarah through a hospital in a wheelchair, wondering if I was (in some grisly sense) at least getting to carry my child in my arms, for a while.

As is usual when things go wrong, Sarah was too ill to do anything, so I went into Caring Husband mode; looking after Jean and Mary, organising meals, cancelling things, supporting Sarah. I'm not shy about my emotions, as many male people are; but while there were things that needed doing and nobody else to do them, I didn't have the mental energy to feel them, so I just got on with it all. But once Sarah was home again and I didn't need to worry about her health all the time and things had settled down a bit and I had time to think about it (thanks to my lovely colleagues at both jobs, who covered for me), I finally had my chance to cry; Sarah was working on her picture, and wanted me to choose some colours for the rainbows in the lettering. I put together a spectrum (my choices are at the top of the first L, by the way), and I remembered being a little child, choosing that my favourite colour was blue; I went for sky blue and sunrise yellow at the time (although I've since moved towards darker, purplier, blues), and I wondered what this child's favourite colour would have been, and then all the pain came up and I cried on Sarah's shoulder. I still feel a lot of pain, but a good long sob helped me to heal a lot.

Now, I just want to find ways to record the existence of the little thing. When we had the initial scan, I wrote down the dimensions they told me on the appointment letter, because I feared that and the memory of seeing it, fleetingly, on a screen might be all we got to keep. When the ashes have been scattered, I'll try to find the memorial garden in Cheltenham, and go and visit.

I wrote a poem, in Lojban of course. I could translate it to English, but I'd have to either drop all the rich attitudinal indicators (which would make it rather boring) or try and explain them in English (which would make it long and convoluted), so I'll just leave it as-is.

.i .u'anaisai mi ba'ozi te tarbi
.i .uinai mi na ba pamjai le .iu tarbi
.i .uinai mi na ba tavla le .iu tarbi
.i .uinai mi na ba bevri le .iu tarbi
.i .a'onai mi na ba zgana lo nu le .iu tarbi cu ke banro
   joi cisma
   joi klaku
   joi cadzu
   joi tavla
   joi prami
   joi jmive

.i le .iu tarbi na ba djuno fi mi

.i co'o tarbi

Processor architecture (by )

The current state of the art in processor design seems to be a reasonably complex instruction set, which is interpreted by a thing that translates it into a series of more primitive instructions which are then fed into some kind of multiple-issue pipelined thingy with speculative execution. You know, the kind of stuff x86 has been since the 386. 64-bit instructions, vector SIMD instructions, lots of cores and all that are just variations on the theme.

I'm sure this is a local maximum in the space of processor designs. So few of the transistors on each chip seem to be actual ALU doing something useful. All this translation and pipeline control seems to be a lot of logic that's just adapting to the impedance mismatch between the ALUs and instruction set...

So, I'm always interested in more exotic processor architectures, and there's two different threads I'd love to explore (as in, design and simulate in an FPGA) if I had time. The common theme is simple control logic; this means you can fit in more ALUs, or wide ALUs and registers, in the same space - or just fit more cores and more cache on the same die.

Zero-operand stack machines

The idea here is to use a stack instead of a register file. This means that instructions just need an operator (eg, "add") as the operands are implicit - the stack always provides the inputs and outputs. This means that the instructions can be very small due to the lack of operands; generally, much smaller than a machine word, so each word loaded can have several instructions in. This can mean that the memory bandwidth required to feed the chip with instructions is reduced; and since the decode and control logic becomes very simple, you can sustain a high clock rate with minimal pipelining, so reducing the memory bandwidth consumed by instruction loads is handy.

That means you can't fit literals or static addresses inside instructions, though, so you need something like a "load immediate" instruction that fetches the next word from the instruction stream and pushes it, rather than treating it as instructions. If an instruction word contains several "load immediate" instructions, then that many subsequent words of instruction stream could be literals!

One example of this approach is a Minimal Instruction Set Computer, but the concept is broader than that. Large instruction sets can be easily supported.

The control logic boils down to loading an instruction word, then treating it as a FIFO of smaller instructions to execute while the next instruction word is loading. Most instructions just engage an ALU circuit hardwired to the top element or two of the stack, whose output becomes the new top of stack. A few might transfer data to/from a memory access unit or a register, including the instruction pointer to change the flow of control. Not many gates are needed to decode an instruction, leading to the short cycle time.

Instructions that can't complete in a single cycle present a problem, though. The use of a stack tends to mean that an instruction depends on the result of the previous instruction, so it's tricky to execute several instructions in parallel and thus make progress in the presence of weighty multiply/divide instructions or memory reads.

I can think of three ways of overcoming that, and you can combine all three:

Multiple stacks

The approach taken by the 4stack processor is to have four stacks, each with its own independent ALU. Each instruction word has an instruction for each ALU in, and they execute in parallel on each clock tick. Presumably, there's some means to transfer results between the stacks - I imagine a bus joining them, an instruction to pop a value from the stack onto the bus, and an instruction to push from the bus. The timings of the bus reads and writes are such that it's possible to have an instruction word with a pop->bus from one stack and push->bus for one or more stacks that do such a transfer in a single cycle.

Due to the synchrony of instructions feeding into each ALU, we can't "stall" an ALU. If one of them executes a weighty instruction or has a cache miss on a memory read, we either stall ALL the ALUs at once, or we mandate that certain instructions are followed by a fixed number of NOPs before another instruction can execute, to allow time for it to complete.

This puts the onus on the compiler to schedule instruction-level parallelism, and means that the compiler needs to know the precise timings (and number of ALUs) of the target CPU - we can't use the same instruction set for a broad range of implementations!

Result registers

Weighty instructions might not put their results straight on the stack; instead, the instruction might cause the inputs to be pulled from the stack and the instruction starts executing. When it completes, the result is latched into a result register, and a later instruction pushes the contents of the result register (stalling if it's not ready yet). This means that the instruction stream can get on with other stuff while the lengthy instructions run. However, it requires such multi-cycle instructions to inherently work differently; and it puts some onus on the compiler to know how many instructions to wait between starting these instructions and trying to access their results for best performance.

Virtual stack

Finally, we can virtualise the values on the stack. A division instruction, for example, might read two actual values from the stack and then push a token that means "Wait for the result coming from division unit 7". If the next instruction is an addition, then it would read that token and (say) a literal value from the next stack position; since one of the inputs is a token it can't execute yet, but it still assigns an addition ALU and loads the literal value. But it tells division unit 7 to, when it completes, push the result into port 1 of addition ALU 3; and it pushes a token that means "Wait for the result coming from addition ALU 3", and so on. Basically, rather than waiting for operations to complete so you can push a value to the stack, you can instead push a reference to an operation in progress; a cluster of ALUs and memory access units connected by suitable buses then becomes a kind of dataflow machine which is fed connections from the instruction stream, in effect taking the condensed zero-operand instruction stream and using it to assign dependencies between instructions, rather than using virtual registers to assign dependencies as in current CPU designs. But this requires the kind of complex control logic that I feel current CPU designs are drowning in.

Transport-triggered architecture

Another way to simplify control logic is to build your CPU as a bunch of modules with input and output ports. Arithmetic and logic operation modules have one or two inputs and a single output; a memory reader has an address input and a data output; a memory writer has address and data inputs and no outputs; registers have an input and an output; and so on.

Each instruction contains a few bits to control whether the instruction executes conditionally on bits from a flag register, then an output port to read from, and an input port to write the result to. The decoding consists of checking the conditional execution flags then either doing nothing, or pushing the input and output port IDs onto two address busses and toggling a strobe line that causes the output port to write its contents to a data bus, and the input port to load from it.

As with the zero-operand stack machines, the instructions are small, so can probably cram several into a machine word - maybe split into groups that share a single set of conditional execution bits, for even more compactness. These instructions are all operand and no operator!

To insert literal values in the instruction stream, one can again have an output port on the instruction fetch module, that when read pulls a literal value from the instruction stream and stops it from being interpreted as instructions.

The output of each module is a register, where a value appears as soon as it's ready and waits until it's read - so there's no need to explicitly store it in a general purpose register. However, the CPU might have a few general-purpose registers anyway to store stuff in, as well as the usual instruction pointer, flags, and machine control registers.

This makes it easy to exploit parallelism; the instruction stream can trigger lots of modules and then come back later to read their output registers. The compiler might need to know the cycles required to do various things and not read the outputs until they're ready, or there might be handshaking on the internal bus so that instructions stall until an output is ready, which makes it easier to deal with things like memory reads that can take widely varying numbers of cycles to complete. Even then, the compiler can still benefit from knowing cycle timings in order to schedule stuff better.

Modules could be pipelined. Rather than having four multipliers, you might have one that you can feed (say) four sets of inputs into and then, later, read the output register four times to get the results. The compiler might need to know how deep the pipeline is to avoid overflowing it with results; or the hardware spec might mandate that up to sixteen multiplies can be pipelined, and put a FIFO on the output register to make up the extra capacity needed beyond the number of pipeline stages it has.

The downside is that the compiler needs to know how many modules there are and what port numbers are wired up to what. This, again, makes it hard to have a single executable that can run on a wide range of implementations of the design.

However, this looks rather like the execution model behind the virtual-stack machine discussed above - so perhaps we could have a generic stack-based instruction set that is executed by a virtual stack to generate instructions for an underlying transport-triggered machine...

Modules could be quite complex; for instance, an index register module might comprise a register coupled directly to a memory access system. By accessing different input or output registers, it could update the contents of the register, or write to the memory address stored in the register, or read from the memory address in the register; and different input/output ports could be accessed that cause it to pre- or post-increment or -decrement the index register at the same time, allowing for efficient operations on contiguous blocks of memory. Also, the internal data bus might be arbitrarily wide, allowing ALUs to operate on, and registers to store, vectors of several machine words; modules that only operate on a single word at a time might sacrifice a few input-port-select bits in their instructions to select which word from the vector on the data bus to read into their input port.

To save on space taken up by literals, we can have a simple module with output ports that produce some useful constants (0, 1, -1); or dedicate a single bit of the instruction to selecting whether the input port number field specifies an input port, or is a literal to just load onto the data bus. An input port number will be much smaller than a machine word, so this will only cater for small literals, but most literals are small and we can fall back onto fetching an entire word from the instruction module for larger literals. We might want to sign-extend a small literal, however.

The data bus might become a bottleneck, but that's OK - we can have several of them, and make the instructions specify an input and output port number for each bus; we then trigger multiple transfers in each instruction cycle. This is very similar to having several instructions in a machine word, except that they execute in parallel rather than serial. We just now need to specify what happens if the same port is read or written by two parallel transfers!

Conclusions

A general theme with many of the above approaches is that the compiler ends up needing to know more about the details of the chip implementation, because the compiler is responsible for more scheduling.

Perhaps this is no bad thing - runtime code generation is becoming the norm anyway, and it would be possible to bootstrap the system by having an initial "minimal instruction set" which is standardised, and allows access to a description of the current chip architecture; the runtime code generator can then be compiled (using a compiler written in the minimal instruction set), and then the processor switched into normal mode. This might even be implemented by having a simple version of the stack architecture as a front-end processor that starts executing code while the main CPU is dormant; it then has an instruction that hands an initial instruction pointer value to the dormant main CPU and starts it up. Multicore systems would need only one front-end processor to bring the whole system up!

Another approach might be to have a transport-triggered architecture with a small set of guaranteed modules available at well-known port numbers in every implementation, with variation occurring in the rest of the port-number space. But this requires the instruction format to have enough bits for the port numbers to allow for the largest imaginable processor, leading to unnecessarily wide instructions for smaller devices. Perhaps this can be handled by having the instruction decoder support both standard narrow instructions and implementation-specific wider instructions, again starting off in standard mode and allowing switching to wide mode once the processor definition has been read and used to compile the compiler that can exploit the full capabilities.

Either way, I think that future processor architectures might be more tightly coupled to the compilers than we're used to.

Shame (by )

I'm at the pub for a meeting, but there's a minor commotion from next door; I hear a glass smash and some amused voices. A regular, a well-known local in his nineties, has had too much to drink. A party is organised to walk him to his nearby home; everyone responds with good-natured smiles. "Aw, bless him."

But I am transfixed with vicarious shame. I feel horribly embarrassed for him, and my stomach churns with stress about it. I find everyone else's reactions jarring; they seem mildly jealous of him if anything, while I find his situation absolutely humiliating. If something like that happened to me - no, let me be clear: if I did that to myself and people saw - I would not be able to look those people in the eye ever again. I don't know if I'd be able to leave my house.

I have a mental model of the world, which gives me expectations about what counts as "normal" behaviour for the people and other objects in the world. When I see things happen that are consistent with my model (objects fall to the floor when released, people are happy with they are given cake, that sort of thing) it is unremarkable; things that are inconsistent attract my attention, as they indicate either that I have incomplete knowledge of the situation or a problem in my mental model. As I've built this mental model over the decades of my existence, I've checked every new thing I incorporate into it for logical inconsistency with something else, so I'm reasonably confident that it's consistent and a correct approximation to some kind of objective reality.

The majority attitude towards inebriation contradicts my mental model, but I can't just incorporate it, because it's inconsistent with other things in my model.

For instance, people are critical of flaws in others. As a child, if I made a mistake, I'd invoke the wrath of my mother. At school, if I made a slip and broke the myriad and shifting social contracts, I'd attract the attention of the bullies. In my career, if I make a mistake it will have consequences for my colleagues, the company I work for, and the users of the products I work on. If I make a mistake in my domestic duties at home, my children won't get to school / their clubs / parties they're invited to, or we won't have food for dinner; and they will be angry with me. If I make a mistake while driving, I will injure or kill myself or others. I often hear people complaining about other people who have made mistakes, even if those mistakes had no actual negative consequences; they are criticised for making mistakes as a matter of principal.

Mistakes are very easy to make; a moment's inattention can result in something important being forgotten. Slip-ups attract ridicule and disapproval.

But the way people react when somebody has deliberately made themselves into an idiot through inebriation starkly contradicts that general trend. Why is there an exemption made for this case?

I had a dream, when I was aged somewhere between eight and ten or so years old. In this dream, I'm on a huge futuristic spacecraft, of a similar scale to a cruise ship, full of passengers, watched over by a team of sinister police robots. I'm in a fancily-decorated room with little tables dotted around, with passengers sitting at them and chatting. In this room, little drinks are available, in tiny glasses the size of my little fingertip; barely a cubic centimetre each. There is something seedy about this; the drinks are handed out covertly, with much glancing around, out of sight of the police robots. I decide to try one, and the effect is instant; my point of view moves backwards slightly, and I become a third-party observer of my own actions, a passive rider in my own body as I circulate in the room and chatter with people, with this big idiot grin on my face. But the idiot grin attracts the attention of the police robots; scowling and disapproving, the corner me and shoot me with a dart gun which dispenses an antidote, meaning I am instantly myself again. But I feign innocence; I claim I was grinning because I was happy, and that their accusation that I had consumed one of the tiny glasses was unfounded, and act all offended. Even remembering that dream now, thirty years later, causes my face to flush with shame. It's taken quite a lot of bravery to publish it here. I had to build up to it in stages. What I'm ashamed of is that I had a dream in which I was affected by some kind of drug, because it acknowledges the concept of me being so affected even exists.

But far worse than my shame-by-proxy is the sense of alienation, because I'm having this strong emotional reaction that's completely absent in the people around me. It's like everyone around me is laughing and cracking jokes while eating babies. I feel like there's something terribly wrong with everyone around me (which is scary), while logic tells me that the problem is clearly with me. Which is even scarier.

I try and avoid situations where I might be reminded of this. Pubs are risky places to go, but only mildly so; there isn't a strong culture of inebriation in most of them, so I just avoid places like student bars. House parties are far riskier, and I dread being invited to them; accepting the invite may lead to pain, but refusing it means sitting at home on my own knowing what's happening anyway (well, not really knowing; my imagination instead provides a stream of worst-case scenaries), and being on my own while everyone else is having fun (in a way I find inexplicable and distressing) hardly makes a sense of alienation any better. It's worse when the party is at my house (I never hold parties, but people I live with do), because it's harder to hide from a party in my house, people will ask awkward questions if I leave, and I have this feeling like my "safe place" is being invaded; I make my way through life by, where possible, shutting all this stuff away, and it being in my own home makes that harder.

But avoiding situations where people might drink alcohol isn't enough, anyway, even if I could do it perfectly. People still talk about it around me, and thus, I am forced to confront the concept. I can think of no way to avoid it without isolating myself from all people and all mass media.

To be honest, I feel pretty angry about it all. Why do I have to hide, and be an outsider, flinching away from this concept? People around me can, just through saying a few words, hurt me. When a group of friends or colleagues organise a group social activity, I have to choose whether I'll suffer for going or suffer for not going. What's more, I've been told that if I don't go to an event I've been invited to, I'll offend the person who invited me. Apparently this is more important than the pain I'll suffer.

My attempts to tell other people how I feel have often ended badly. Responses tend to be either:

"You're weird, that freaks me out, go away"

"Whoa! That's weird. So does related concept X upset you? How about Y? Really? Hahahah! X! Y! That actually makes you feel ill from me just saying those words? X! Y! Z! This is fun!"

"How dare you criticize my actions! It's my choice what I do with my body, and your choice whether you put up with it or go elsewhere."

Most people just seem confused by it, and then seemingly forget I ever mentioned it. A few people have actually tried to avoid saying things that will upset my in my presence, which is heartwarming, but the concepts are deeply embedded in our culture and are impossible to avoid: attempting to avoid them just leaves awkward gaps, and I know what would fill those gaps. The best that can be done, I suppose, is to say what needs to be said, but without the assumption that everyone feels as the majority does, so I don't feel neglected. But that's not an easy thing to ask.

I don't want to be like this. I can't change the world, so I need to change myself, but how do I do that? It's hard to think about the underlying sense of shame, because the feeling of alienation is too painful; and it's hard to think about the feeling of alienation because the anger clouds it, and anger is such a destructive all-consuming emotion. Indeed, it's taken me years of careful reflection to even isolate the other feelings. My emotional response to exposure to inebriation was basically "Confused burst of painful negative emotions then ANGER". Pulling apart that little burst of emotions before the anger wins out has taken a lot of careful detective work, feeling a bit like a physicist deducing the presence of the Higgs Boson by looking at the trajectories of particles streaming out of a hadron collision. But now I'm aware of the shame at the root of it all, I can feel it. I just can't stop the anger coming in and clouding it.

So perhaps I can address the problem indirectly. What else gives me similar feelings of irrational shame, but without the complexities of alienation and anger on top?

One answer comes to mind: Dancing. I'm usually one of those people who professes he can't dance, and only tries to when under duress; at which point I just find an action and repeat it until whoever's forcing me to dance lets me stop. I have no enthusiasm for it, and struggle to understand why people do it.

But occasionlly, if I'm in a really good mood, listening to dancy music that I have happy associations with, I feel a faint glimmer of a strange pulse-quickening excitement that it might be nice to dance to it. The thing is, if I hold that thought, a flash of embarrassment comes and destroys it, so I need to keep it just out of mental reach. Perhaps if I could overcome that, lesser, shame, I would weaken the greater one. The problem is, I don't know how to. It's not like I'm standing there thinking "I want to dance, but don't know how to"; I mean, I want to dance in the sense that I usually feel very lonely and left out and forgotten when everyone is dancing apart from me, but my problem is that I want to want to dance, and I don't know how to want to.

What else is there that's similar? Oddly, there's something I have the anger about without the shame or alienation: and that's coffee. Around the time when Starbucks was really invading the UK I had a girlfriend who thought Starbucks was great, so I was always being dragged into them. The thing is, I don't like hot drinks at all, and I find the taste of coffee absolutely disgusting. At most, Starbucks could offer me over-priced orange juice, and I got sick of that pretty quickly. This touched a bit of a raw nerve: coffee wasn't being presented as something some people like, as an option; the ubiquitous Starbucks (and their competitors), the attitude of people towards them ("Let's meet in Starbucks", "Fill in this quiz and be rewarded with a Starbucks voucher"), and the decor and advertising all seemed to draw on an assumption that everyone liked the foul stuff, while I didn't.

And, of course, I have a massive chip on my shoulder about that from the alcohol thing. So being offered coffee, or having to go to coffee places and get coffee, gives me this little jolt of irritation. I used to just bite my tongue and repress this, but over the past few years I've decided it's probably healthier for me to let a little bit of snark loose. As a Repressed Minority Coffee Disliker, I probably shouldn't feel I have to put up with everyone assuming I don't exist; so I'm trying to actually say that I think coffee tastes awful and that I hate coffee shops when it comes up. It's cathartic, but there's a lot of pent-up bile left; this will take a while to finish... And I don't think that fixing that will do all that much to fix my anger about alcohol.

There's one more thing that I think might be related. I really like funny things; as a kid, I really liked surreal comedy, and could easily end up laughing so hard I could barely breathe. These days, I've lost that; I feel too much shame about the thought of somebody seeing me laughing like that. I've come close, but then I feel a sudden chilling fear that I'm going to irritate people or that they'll think I'm an idiot. But the difference here is that I can remember not having that fear.

If you'll permit me an aside, I've been teaching my daughter how to ride a bike. I did this by holding her upright and pushing her forward as she starts to pedal, so that she can get going on the bike and learn to balance, because she was struggling with getting started at all. I decided to hold her up and skip the getting started part, because learning to keep balancing while the bike's moving and you're already pedalling is a lot easier - but once you've got the feel for that, you know what riding a bike is supposed to feel like. So then you can learn to start, because you know what state you're aiming for when you push off. Trying to ride a bike from nothing is a lot harder, because you push on the pedal and wobble and fall over, because even if you push off right you don't know how to ride a bike so you'll fall over - so you can't tell if you're learning to push off right or not.

So while I can't imagine dancing in front of people (as opposed to going through the motions of pretending to dance, which is different), or even doing that unspeakable thing with alcohol that could never be in any way associated with me, I can actually imagine myself having a really good laugh about something hilarious. And, just like how knowing what riding a bike should feel like helped my daughter to quickly learn how to start from standing, I think that means I have a chance of being able to overcome the shame and recover that ability.

And maybe learning to deal with that shame will be transferrable, and I'll be better able to deal with other kinds of shame.

So, who is willing to help me overcome a crippling phobea that's causing me untold misery, by coming around to my place and watching Monty Python DVDs? Soft drinks only, I'm afraid.

Cool things I have worked on: Clustered analytic database (by )

In my last blog post, I talked about how I was involved in building a database optimised for low latency on single-record reads.

So it was a bit of a change to later work on a database optimised for high throughput on bulk querying! Rather than getting single records, it was all about finding all the records that match some criterion and doing something with them as fast as possible, where "fast" is measured in "records per second". When I started there, the minimum time to perform a trivial query (eg, SELECT 1) was about three seconds, let alone any that fetched any data. However, on the right hardware, it could process untold numbers of gigabytes of data per second once it had gotten going. That three-second minimal round trip time was negligible when dealing with queries that processed hundreds of terabytes of data in ten minutes. That said, one of my first projects was to fix some low-hanging fruit, which brought the minimal round trip time down to around 300 milliseconds.

Unlike the previous database, this one had an SQL parser; rather than an API to get and set records, you could enter SQL queries in all their glory, and get back result sets.

You didn't generally INSERT, UPDATE or DELETE individual records (although that was possible, it just didn't perform well, as will be explained later). Generally, data came into the system through CSV or BCP files that were plopped into the filesystem. The system would sort them on a chosen key (why? You'll see), then split into partitions of something like a million records. These partitions were all compressed into "tree files". And because each partition was compressed independently, these jobs could be spread over the cluster.

The compression algorithm was interesting. If you put data in a CSV and gzip it, you'll knock it down to a quarter or so of its size, depending on the data. Our algorithm usually got a factor of ten better compression. This was a big deal for our customers as they had to store many petabytes of data, and dividing the size by a factor of forty meant spending a fortieth of the money on immense disk arrays. Also, most of our query performance was limited by how fast you could read data from disk what with modern CPUs being so good; so compressing things by a factor of forty meant forty times the bulk throughput.

It worked by looking at each column and working out a set of unique values present in it, and sorting them. These columns could then be compressed with fairly conventional techniques for storing sorted lists of things (such as storing each element as the difference from the previous element, and then using conventional algorithms such as deflate on the result).

So if we had the following data:

MakeModelColourQuantity
FordMondeoWhite7
FordMondeoBlue6
FordTransitWhite7
FordTransitRed7
SkodaFabiaWhite4

We would store it as a bunch of columns:

  • Make: Ford,Skoda
  • Model: Fabia,Mondeo,Transit
  • Colour: Blue,Red,White
  • Quantity: 4,6,7

But you still need to know how to reconstruct the records from those values. Each record could be thought of as a vector of integer indices, each of which was an index into the sorted values for that column. If you just stored those records-of-indices, you'd be able to reconstruct the records and would probably have gained some good compression.

If we start the indices at 0 for the first value in each column, our original data might now look like this, referring back to the sorted columns we stored:

MakeModelColourQuantity
0122
0101
0222
0212
1020

But we did more than that. We didn't actually write those records-of-indices to disk. The compression engine took pairs of columns in the records-of-indices representation, and again deduplicated and sorted each of them.

Which pairs you took mattered greatly to how good a compression you got, in subtle ways, but we tried to get related columns. Make and Model are connected, so we'd do well to pick that one first. Each model only ever comes from a single manufacturer in our example, so the combined (make,model) columns only have three values - (0,1), (0,2) and (1,0). So we could store those in a list we'd call an "index set", which actually went onto disk, deduplicated and sorted and compressed like a normal column:

  • Make+Model: (0,1), (0,2), (1,0)

Now the table (in memory) looks like this:

Make+ModelColourQuantity
022
001
122
112
320

Again, we pick a pair of columns. There's no obvious choice any more, so I don't know what our default heuristic would actually do, but let's say we decided to join Make+Model with quantity. The pairings we find are (0,2), (0,1), (1,2) which happens twice, and (3,0), so we'd store:

  • (Make+Model)+Quantity: (0,1), (0,2), (1,2), (3,0)

(not forgetting that the sorted list of values is difference-encoded and compressed, rather than written literally like that).

Our table in memory now has only two columns:

(Make+Model)+QuantityColour
12
00
22
21
32

At this point, we can't continue any further. Each entry in that table corresponds to a record in the original data, and by looking up the index sets we can expand out the columns to recreate the original table in its entirety; so we'd sort that final two-column table and store it (difference-encoded and compressed, of course). Our final entry in the file, the "root indexset", would be:

  • root: (0,0), (1,2), (2,1), (2,2), (3,2)

But we don't need to decompress the entire tree. Unlike a "gzipped CSV", if we just want some of the columns of the table, we can not bother traversing the index set tree to extract every column. If we just wanted the colour for every car in the table, we'd read the root indexset, and the colour valueset so we could look up the colour values, and we'd be done. If we wanted just the makes, we'd need to read the root indexset, the make+model+quantity indexset, the make+model indexset, and the make valueset.

However, if we wanted MIN(Make), MAX(Make) or did a SELECT DISTINCT Make, then we don't even need the root set - we can just open the Make valueset and take the first or last value, or all values.

Anyway, tree files, once built, were given unique partition IDs and put into some kind of storage. We didn't really care what the storage was, and supported several different backends - anything that let us save a file with a name, and get the file back with the same name, would do nicely. We supported raw filesystems (over NFS or on a clustered filesystem so all the servers in the cluster could see it), HDFS, and various enterprise storage devices.

So how did we know what tree files made up each table? Well, a data structure called the registry mapped table names to table objects. We supported a range of different types of table objects, but the most common one was called a "Lazy Union", which represented a UNION ALL in the SQL sense - a bunch of tables concatenated together. It was lazy because we used it to union together all the tree files that made up a table, but the system avoided actually trying to load all the tree files and UNION ALL them together (as that would generally be many petabytes of data); in fact, we basically never actually read any tree files in a lazy union - we just queried it for metadata. The lazy union object itself stored a list of tree files and metadata about them, which was kept in a file called a "partition cache".

So what happened if the user did an INSERT? Well, we had a feature to enable this; the extra records would go into (effectively) a CSV file associated with the table, and a reference to that CSV file put into the partition cache for that table. When the CSV got big enough, we could built it into a tree file and start a new CSV file.

As for UPDATE and DELETE: Well, one caveat of our representation was that we never modified tree files. So to UPDATE or DELETE something, we'd create a "delta file" that referenced the tree file we wanted to modify, then contained instructions as to how to modify it (deleting or updating records). We'd replace the reference to the tree in the partition cache with a reference to the delta file; and voila, the system would see changed or removed records.

We used the immutability of tree files to support point-in-time querying. If you set a query timestamp on your database session, the system would effectively roll back the log of changes to the partition cache to restore the table to the state it was in at that point in time, undoing bulk imports, INSERTs, UPDATEs, and DELETEs.

This trick also worked for ALTER TABLE. If you modified the schema of a table, every existing reference to a tree (or a delta file or a CSV or anything else) in the partition cache would be replaced by a reference to the original file plus a schema-change file that explained the change. So the old data would be translated as it was read, with columns being removed, re-typed, added (with a default value), etc. And, of course, if you put a query timestamp on to query the tables as they were in the past, those scheme changes would appear to be undone; deleted columns would return.

When a query came in that looked like SELECT ...some expressions... FROM table WHERE ...some conditions..., we would look the table up in the registry, and get a lazy union object connected to the partition cache file representing the current state of the table. The query planner would "push down" the condition clauses into the lazy union (LU) object, and it would use the metadata in the partition cache to restrict the list of partitions that might match. The main kind of metadata we stored was the minimum and maximum values of every column in each tree file. If our WHERE clause was column BETWEEN a AND b, we could skip any tree whose value range for that column didn't intersect a..b; and many other kinds of WHERE-clause expression could be used to eliminate based on range.

This is why we sorted each batch of data as it came in. Most of the data we stored was logs of something, so each batch covered a fix time range anyway, and so would produce a bunch of trees restricted to a range of some datetime column; so we could query efficiently on date ranges. But if we then sorted by the next most common WHERE-clause constraint column, we would constrain the trees in the batch to each cover a narrow range of that column, and therefore query more efficiently for ranges of values for that column.

But what about restrictions on columns other than event date and the sort column? Other values in the table, unless correlated with one of those columns, tended to end up rather randomly scattered, so the min and max for that column in every partition tended to approximate the min and max for the entire table, and we couldn't eliminate trees in the lazy union using constraints on that column.

Normal databases let you create indexes on columns, but a B-tree index tended to take up space in the order of many bytes per record in the base table. Unless the table had a lot of columns, this means the index is something like a tenth the size of the uncompressed data. As our tree files were about a fortieth the size of the typical data we had, even a single B-tree index would dwarf the original table; and the compression was a big selling point for people. THat wouldn't do. We needed subtler tricks.

So we let users nominate columns they'd like to do equality queries on (field=literal). For each partition, we'd then create a bloom filter of all the values of that column occurring in the partition, and store it in the partition cache, associated with the tree containing that partition. Bloom filters are basically a set, but with a twist: you insert values into the set, then you can query the set to see if a value exists. While a normal set gives you a yes or no answer in this case, a bloom filter gives you a "maybe or no" answer; it's either definitely not in the set or it might be in the set. In exchange for that vagueness, the bloom filter is a lot smaller than a conventional set representation. How big it is is something you can choose - larger ones are less likely to say "maybe" when the answer is really "no" (false positives). In our cases, we went for a 5% false positive rate by default, which gave us bloom filters that consumed five bits per unique value in that column in the tree file; as each tree held about a million records, so would store anywhere between one and a million unique values for a given column, that meant the filters consumed anything from a byte to 600KiB. When a query came in of the form SELECT ... FROM table WHERE bloom_filtered_column=x, we'd scan through all the bloom filters in the partition cache to look for trees that might contain the value x, and then perform the query on those.

And we could get more advanced than that. If users wanted to optimise queries of the form column LIKE '%foo bar%' - looking for a substring in a field - they could create another kind of bloom filter, which only worked on text columns. In this case, we'd go through each value of that column in the tree and extract all the groups of N characters (N might be four or five, typically). So if the field contained I eat food, the groups of four characters would be "I ea", "eat", "eat", "at f", "t fo", "foo", and "food". We'd take all of those for the entire column in the tree, and make a Bloom filter of them. Then if the user asked for a substring of N characters or more - N had to be chosen to be smaller than or equal to shortest expected search substring - we would do the same for the search string and then look in the bloom filters for those. So if the user had column LIKE '%eat food%' in their WHERE clause, we'd split that into "eat", "at f", "t fo", "foo", and "food". Only trees whose bloom filter return a "maybe" for all of those substrings can possibly contain the target string, so we can eliminate any that don't. We had a few other variants on the bloom filter, too, for even more specialist cases.

So, let's recap. I've explained how we physically stored table data, how we imported batches of data, and started to explain how we do simple queries on a single table.

As I was saying, when such a query came in, we'd get this Lazy Union (LU) object that represented all the trees in the table, and we'd tell it to restrict itself to trees that might be pertinent to the query based on the WHERE clause. If this step didn't reduce the list of trees at all, that was fine; what mattered was that we eliminated as many trees as possible to improve performance by considering less data, while never eliminating a tree that might contain records of interest.

Once we had the restricted list of trees in the lazy union, we would be able to generate the parallel query plan. In this simple case, we'd generate a template job that could be applied to each tree file (in this case, this job would open the tree file it was being applied to, request only the columns it needed for the SELECT clause, and then filter them on the WHERE clause to find only the matching records). That template job would be applied to each tree found in the Lazy Union in parallel across the cluster, and the results streamed back to the user.

Join queries were a bit fiddlier. If the user wanted a simple two-table join, such as SELECT ... FROM a, b WHERE a.x = 123 AND b.y = 456 AND a.z = b.z, we would load up two LUs, one for a and one for b, and restrict both by the WHERE clauses that only affected on table: a.x = 123 on a and b.y = 456 on b. Those would limit the numbers of partitions involved, but still, any partition from a could contain rows that match rows from any partition of b - so we would need to consider all possible combinations. This meant running a number of jobs proportional to the product of the size of the tables. So we tried to reduce that in a few ways.

Firstly, once we'd built a restricted lazy union for one table (say, a), we could quickly work out an approximation to the MAX and MIN of the join column, z, in table a. Because we'd already restricted the LU on the a.x = 123 clause and just had partitions matching that, we could just look at the MAX and MIN of z in those partitions (data which we held in the metadata in the partition cache), and then when we built the restricted LU for b, we could restrict it on b.y = 456 AND b.z BETWEEN min AND max, thereby further restricting b. Of course, once we'd done that, we could approximate the min and max of z in b using the same technique, and further restrict the LU for a, and thus get a smaller z-range for a, and further restrict the LU for b - in fact, we could continue this until the range of z stopped shrinking if we wanted to, but in practice, it usually wasn't worth going for more than a couple of iterations.

Also, rather than building a job for every possible combination, we could instead look at each individual partition of table a, find the actual set of values of z in that partition, then restrict a copy of the LU (already restricted as already described) even further to find only matching trees from b to combine with that partition from a.

Some queries don't need to return all the matching records; some queries were just for aggregrates such as COUNT or SUM or MIN or MAX; some queries were GROUP BY queries that would aggregrate in groups; some queries still returned every record but needed to sort them first because of an ORDER BY. In these cases, we still generated per-tree jobs in parallel to find relevant records, but then we'd run a single special job on a single cluster node that would read the results of the parallel phase and do final processing.

There's a final simple case - a query so simple it just needs a single job. SELECT 1, for instance, which needs no table data from tree files. Or queries that can be fulfilled purely from the tree metadata in the partition cache, so again didn't need to read any tree files.

But what about more complex queries? Joins of three or more tables? Queries with subqueries? Common table expressions with a WITH..AS? Well, we handled all of them by splitting them up into a directed acyclic graph of simple "sub-queries", which operated on a combination of actual "base" tables and the results of other queries; a single final sub-query would produce the final results of the query. In fact, internally, we converted the entire query into a big WITH..AS with each query being one of the above cases we already know how to handle.

We would run the subqueries in parallel, as soon as all the tables they needed existed. To begin with, only the subqueries that depended purely on base tables would be able to run, but the results of subqueries would be built into temporary tables, in a special registry namespace local to that query. As subqueries finished, other subqueries would become runnable because their input tables would exist. We'd run subqueries in parallel across the cluster, which was pretty cool. Eventually, the final query would become runnable; we'd run it and send the results to the user and be done.

So a query that joined tables a,b,c and d could join a and b in parallel with joining c and d, then join the results of those two.

But with all this parallelism going on - different subqueries running different parallel jobs at the same time, all competing for cluster resources with each other and other queries in the system - analysing what happened during a query could have been difficult. Just writing logs on each cluster node wasn't really enough. So each query was given a cluster-wide unique ID when it started, which was passed down to every subprocess involved in executing it. Each process within a subquery had a role name unique within the subquery, assigned as part of the query plan; so a process could be identified by the query ID, the subquery name, and the process role name. So we logged events into a per-process log file on the node the process ran on, identified by the process' unique identifier. Once the query was done, we'd either make every node delete all log files for the query ID if nothing went wrong; but if the query failed, or if a trace was explicitly requested, we'd gather all the log files for the query together onto one node and consolidate them into a "trace file".

This file was an sqlite database, containing all the log events in a highly structured form that made it easy to pull out only events pertaining to particular parts of the query. We also stored high-resolution timestamps so we could extract precise timings for performance analysis. The log events fell into three groups: Traditional log strings as you'd find in any UNIXy log file, machine-readable data items with a key and a value, and phase start/stop events. The latter kind broke the query down into phases, ranging from high-level stuff such as entire subqueries and the planning versus execution phases of subqueries down to fine-grained parts of individual jobs: opening tree files, performing various algorithms, and so on. These served two purposes: they gave other log events a context as to what the system was doing when they happened, and they let us see where the time was spent in a query for performance reasons. The machine-readable data items were generated to record various interesting properties (one example was that every process logged its rusage counters at the end of execution - CPU seconds consumed, peak memory size, number of IO reads and writes, that sort of thing). We recorded a lot of interesting things, such as the peak parallelism of subquery execution.

Because of all the parallelism, phases overlapped with each other; so I took great pleasure in writing a tool that would take a query trace file and produce a Gantt-style chart showing how the time within it broke down. This thing was very popular; support staff could ask users to send them the trace from a failed query, or request a trace for a poorly-performing query, and then we could dissect the trace file to work out what happened. Also, they were popular with the test team; a query that was carefully written to test a particular algorithm in the system could check that the algorithm was actually used by looking in the query trace. As a future change to the query optimiser might alter how queries executed, it would otherwise have been quite common for a test that checked we got correct results in a situation where a certain algorithm was used to no longer actually trigger that algorithm, meaning a part of the system would be left untested. Logging the peak parallelism of subquery execution as a machine-readable data item meant that the various limitations on maximum parallelism could be tested; and tests that were meant to check that everything worked correctly at high levels of parallelism could check that the desired parallelism was actually achieved and some other bottleneck in the system didn't prevent it.

It was a very different world, working on that system compared to the low-latency database I covered in the previous blog post. Which, me being me, just makes me think about ways in which the two approaches could be merged into a database capable of supporting both kinds of workload...

Cool things I have worked on: Low-latency highly-available NoSQL data store (by )

I've worked on a bunch of cool things in the past; and since I'm always explaining them to people, I realised it'd be good if I just blogged about them so I can just share the link.

I was asked to help design and then build a database to back a social networking system. The main requirement was to be able to satisfy lots of single-record primary-key fetches from a single thread as quickly as possible, because a PHP web app would be requesting lots of data to build the page; we had to do a few hundred of these "random point queries" in a small fraction of a second to get the page rendered in time. It had to be able to scale sideways, by adding more machines to the cluster. It also needed to be able to update this data, but at a much lower load - updates and creations of single records happened when a user went and did them, but people browsing around the site would hit several pages, each of which would request hundreds of records.

And it needed to be highly available. The app was to be spread over two datacentres for redundancy purposes, and as long as at least one server was reachable, the whole thing had to keep working as much as possible. We couldn't use "quorum" systems where a majority of the nodes needed to be reachable to make an update, for instance.

This, combined with the projected dataset size being a few terabytes at most, meant we decided to go for full replication. Every node had a copy of the database, and rather than being a daemon you spoke to over a socket like a conventional database, we'd write a shared library that would be loaded into the PHP process, so as to avoid context switch overheads for each query. The shared library would read the data files directly from disk.

The on-disk format was B-trees, but with a multi version concurrency control setup: update transactions (more on those later) would write all the new data into unused space until it was ready, so reads could still be happening while an update was being prepared, so as not to delay those all-important reads.

Updates were done with a reliable multicast protocol. The shared library would broadcast up dates to all nodes currently reachable from the process sending the update. Those updates were received by a daemon process running on every node, which took updates from the network and applied them to the local disk storage. But there were many devils in the details here.

Because every daemon had to process the entire update load of the system in parallel, they had to try and maximise the use of disk bandwidth. We couldn't afford to be rushing around updating every record with its own transaction, with all the sync overheads; we had to batch them somehow to amortize the overheads of committing them to disk reliably. Also, as updates were handled asynchronously and sent from a large pool of PHP processes, we needed to stop the PHP application from sending updates faster than we could process them and overrunning the network buffers. And just to make matters worse, updates could come from the network in unusual orders for various reasons, but we needed to apply them in a consistent order on every node so multiple updates to the same record all produced the same result in the end.

So, every daemon had an update queue. Its highest priority was pulling updates off the network and into the queue; its second priority was processing updates from the queue. But the queue wasn't just a FIFO. We gave every update a sequence number when it was generated, from a sequence number counter in shared memory on every node; and to ensure global consistency, we implemented Lamport timestamps - when an update was received from another node, we immediately updated the local sequence number counter to make sure it was more than the received sequence number, so that all updates issued from a node with knowledge of a given update would carry later timestamps, thereby ensuring that any transaction which read a current value and then updated that to a new value would be applied "later" than the update that provided the old current value.

So our queue ordered incoming updates by sequence number (using the ID of the issuing node as a tie breaker, to ensure a global ordering). We also stored the update sequence number on every record on disk, and refused to update a record to an "older" value, to deal with re-orderings over a longer time period than those which met in the queue. The queue also noticed updates to the same record group and merged them into a single update, always replacing older data with newer according to the sequence numbers. This meant that a flurry of updates to the same record would, as long as they all arrived before the queue was flushed, all "coalesce" into a single update to the on-disk data.

Wait, I said "record group"; what's that? Well, in the data model, often a bunch of records would represent parts of one object. Imagine an invoice in the traditional SQL format, with an invoice record and a bunch of line item records that are components of it. We encouraged the use of small records, as we updated records by providing a new value for the whole thing, and lots of small records meant that finer-grained bits of a record can be updated in parallel without trashing each other's updates. So as part of the "schema" we defined a mapping from record primary keys to record groups; many records would go in the same group, and was stored as a single entry in the on-disk B-tree, so we wanted to coalesce updates to the same record group into one, thereby getting more records updated in a single read-modify-write cycle.

Under the hood, the queue was two data structures - a priority heap full of record-group update lists, ordered by sequence number; and a hash table from record-group IDs to the same update lists. Incoming updates were mapped to record-group IDs and looked up in the hash; if an update to that group already existed then the update would be merged into it, and if the sequence number was older than the one already in that queue entry, we'd move the entry up the queue by re-inserting it into the priority heap. Otherwise, we'd create a new entry with a single update to the record group, and pop it in the hash and the heap. Keeping the two in synch in all cases was a bit of work, but once we'd covered all the cases, it looked after itself.

The queue also helped us prevent overload, too. Every node measured how big the backlog was, in three different ways: There was a maximum number of record group updates we wanted to hold in the queue, a maximum number of bytes of new record data in the queue, and a maximum average latency (in seconds, between the PHP code requesting the update and it getting written to disk on the node). We tracked the latter by putting wall-clock timestamps in the update messages and keeping an exponentially-weighted moving average of the difference (like a UNIX load average), and the former two came directly from the queue data structure. On each node, we compared each of the three measurables to our target and scaled it to a 0-100% capacity measure; we then took the worst of the three as the backlog score for that node, a measure of how "far behind" it was.

These were broadcast to the other nodes in the network, to compute an overall "worst backlog of any node" for the entire cluster. We were a bit clever with this - if a node's backlog wasn't the worst, then there was no point in it telling any other nodes about it, so only the nodes that were at the worse backlog level in the system were broadcasting their state. Again, there were lots of edge cases in this, which we carefully handled to make sure the system as a whole knew the worst backlog state of any node in the system, with the minimum of network noise.

So each node now knew how backlogged the cluster as a whole was; and we used that information to apply back pressure to the PHP clients. The backlog score was stored in the shared memory segment, and the function that was called from PHP to send an update consulted it. If the backlog was more than a minimum, then gentle pressure was applied to slow the PHP application down - before sending the message, we would sleep for a duration that depended upon the backlog level. If it was at or more than 100%, we'd actually sleep in a loop until it abated; but otherwise, we'd sleep for a time that increased smoothly from zero to infinity as the backlog went from 0% to 100% (accounting for the situation where the backlog dropped while we were sleeping, to a point where we'd already slept enough). This meant that if the PHP app was trying to generate writes faster than we could get them to disk, there'd be a certain amount of "stretchiness" in the system as the implicit queue in the network and our update queue took up the slack. But if the update queue kept growing, the PHP app would face increasing delays in getting updates applied. This created a negative feedback loop (as there was a finite limit to how many PHP processes could exist across the cluster), and we tuned the shape of the feedback curve such that the system would settle to a stable backlog level (and, thus, a stable sleep duration, and thus a stable write rate) rather than oscillating.

This meant that when the PHP code did an update, it would be some time before that update was visible on every node. For some parts of the system, that was fine, but for other parts, an immediate response to updates was required, so we gave the caller of the read function a choice of "consistency level". At the most basic and fastest level, it would just check the local state, and potentially miss out on updates that were still flying through the network. At the "immediate" level, however, it would first hash the record key and use that to select a "primary node" for that record. That node would be responsible for storing, in-memory, the latest state of that record if it was currently being updated; because the update function also offered a consistency level setting. The most basic one just send the update out onto the network, and the "immediate" level first sent the update to the primary node's in-memory update buffer. The use of specific nodes for this threatened our reliability guarantee; if that node was unavailable, or crashed and restarted while an update was in progress, we'd lose the update from the buffer - but it would still proceed through the normal mechanism. So if updates and reads to a record both requested "immediate" consistency, we would not see update latency (except when nodes failed), but at the cost of network round trips harming the latency.

Updates had a third consistency level, too, which was meant for creating new records with an external primary key (such as a user record, keyed by email address): if this was selected, then a two-phase commit process was used. The record creation would be proposed on every reachable node, which would cause that primary key value to be marked as reserved in the database, or the proposal would be rejected if it already existed or was already reserved. If a majority of the nodes in the system were reachable and OK with the update, then it would be actually performed; otherwise, it would be rejected. These were slow, and we tried to avoid them, but some parts of the application required that level of atomic global consensus.

You might be worrying that these asynchronous updates could get lost between a PHP process sending them to the network and at least one node receiving them, but all updates were received by the local node intrinsically; and once an update is on one node, it'll get to all the others. The reliable multicast protocol lets each node know if it's missed some updates from others (by using a per-node monotonic sequence number on every message, and looking for gaps). If that happens, it would try and retransmit if they were still in the send buffers; if not, then a network break would be declared, and the nodes would know they'd missed some updates from some other nodes. This would start a process by which all nodes that had missed some updates would declare the range of global sequence numbers they seemed to be missing, and the nodes that had updates in that range would then re-transmit them all (by using a special index from update sequence number to record across all tables). They would be retransmitted with the original sequence numbers, not new ones, so no matter what order which nodes got them in, they would eventually get to all nodes and end up in the same global ordering, so every node would end up with the same final state.

A node that starts up from a reboot or from fresh used the same process - the knew the last sequence numbers they'd been fully in synch with the cluster at, so they'd be able to request the latest state of all records updated since then. A totally new node would have seen no updates, so would request all records changed since timestamp zero, which would be all records. In these cases, the node would request that the playback be unicast to it alone, as opposed to the network partition case where nodes would cooperate to request broadcasts of all the missed updates to get the whole network back together.

And so we had a system that could (on my 2008-era laptop in a VM) do several thousand random record reads per second from a single thread, while handling bursty asynchronous updates of a thousand or so records per second (many many more if they all happened to fall into a smaller number of record groups!), while nodes came and went from the cluster, and the cluster was occasionally split into arbitrary sub-clusters due to network link failures. When the cluster was working properly, it would provide a completely consistent view of the database; but when the cluster was in a failure state, all nodes could service reads and updates (except the super-consistent unique primary key creation updates, which would all fail if a quorum of nodes couldn't be contacted to agree), but updates could then be delayed. I've not seen a system since that can provide that level of service.

WordPress Themes

Creative Commons Attribution-NonCommercial-ShareAlike 2.0 UK: England & Wales
Creative Commons Attribution-NonCommercial-ShareAlike 2.0 UK: England & Wales