## Replicated data(by alaric)

The project I'm currently working on is to build a replicated database; and, of course, this gets me thinking about TUNGSTEN, my own replicated database design.

Now, a fully-replicated database like the one I'm writing works by multicasting write operations to every node, so that each node has a copy of the entire database. This means that you can increase your read capacity by just adding more nodes and spreading the work between them. Each node can read its local database independently of the others, so there's no inherent limit to how many reads the cluster as a whole can do in one second. As long as you have unbounded money and space to keep adding more nodes.

However, since writes get multicast to every node, there comes a point at which your write load is the highest the weakest node in the system can keep up with. Adding more nodes doesn't help, since every write still happens on every node. The cluster as a whole has a maximum write capacity, regardless of the number of nodes.

So this system will do for now, but the next big bit of work on it will be to increase the write capacity - by partitioning the data set.

Partitioning means that we no longer store every record of the database on every node, but instead have some rule that assigns each record to a subset of the nodes that are responsible for it. Computer scientists will know what I mean if I say "hash the primary key of every record and take the remainder after division by the number of subsets to get a subset number", but for the rest of you, imagine we split the cluster into two subsets - called A and B - and assign A responsibility for every even-numbered record (assuming records are numbered) and B is responsible for every odd-numbered record.

We then multicast writes to all even-numbered records to every node in A, and writes to all odd-numbered records to every node in B. So all the nodes in A will have copies of all the ven numbered records, and so on.

In other words, rather than storing a copy of every record on every node, we store exactly half of the entire database on every node. Which means that if the system as a whole is doing 100 writes per second, each node will get writes to half of the database - if the writes are to random records, that's then 50 writes per second. So the system as a whole can do twice as many writes per second.

When a node needs to read some data, however, it can no longer just query its local copy of everything - if it is in the subset that's responsible for the record it wants it can, but otherwise, it needs to talk to the other subset over the network to ask for a copy of the record. So reads get a little bit slower. But that's not a great problem if you use something like memcached across the entire cluster.

But what's the correct number of subsets to have? Two will double your write speed, and slightly reduce your read speed on half of all reads. Ten will multiply your write speed by ten, and slightly reduce your read speed on all but one tenth of reads. At the opposite extreme, we could have NO copying of data - if every record has only ONE node that's responsible for it, all writes to that record will go to that node, and all reads, since that node has the only copy.

This produces a very scaleable cluster, since you can just add more nodes to increase the read capacity (the records are spread over more nodes, so the number of reads per second to each node decreases withe very node added, meaning the system as a whole can perform more reads per second before nodes start to get overloaded). And the write capacity increases in the same way. And another property we've not discussed - total storage capacity - rises; by adding more nodes, we have more disk space to store more records on, while the totally-replicated system is limited in that the system can never store more records than can be fitted onto the smallest node in the cluster, as every node has to have a copy of every record.

However, there is another important property of the cluster: fault-tolerance. If one of our nodes dies, then we lose the records on that node, since no other node has copies of them. This is bad news, since the more nodes you have, the more things there are to break. If a computer has a 0.1% chance of dying in any given day (which is realistic) then a room of a thousand computers will suffer around one casualty every day. Even a cluster of a hundred computers will lose several each month.

But the original fully-replicated cluster doesn't bat an eyelid when it loses a node. In fact, it can lose several nodes, and no data will be lost unless ALL nodes go down, since every node carries a copy of the data (and when a new - or newly repaired - node is added to the cluster, it automatically sucks up the latest data from an existing live node, to get back up to date.

Also, if the cluster becomes partitioned - some of the intervening network dies, so some of the nodes are isolated - then it would be nice if the partitioned nodes could still at least run read-only, able to access local copies of data. But if there isn't a copy of a wanted record anywhere in the partition, then it can't even run read-only.

Pages: 1 2 3