Node Trees: A model for configuring and managing large distributed systems (by )

(FX: Flashback wibbly wobbly transition...)

So, as a teenager, I started working on ARGON, a distributed operating system. At the time, the CPU cost of encrypting traffic could be a significant matter when communicating over untrusted networks, so I'd worked out a protocol whereby network communications between clusters could negotiate to find the lowest-cost encryption scheme that both parties considered acceptable for the sensitivity of the data being transmitted: more sensitive data would require more secure protocols, which presumably excluded cheaper ones.

But I wanted to do something similar for communications within a cluster; I started with the same idea - finding the cheapest algorithm considered secure enough for the communication at hand. This could be simplified, as all nodes within a cluster share the same configuration, so both will agree on the same list of encryption systems, with the same security and costliness scores; so no negotiation is required - the sender can work out what algorithm to use, and be confident that the recipient will come to the same conclusion.

However, it pained me that highly sensitive data would be encrypted with expensive algorithms, even between machines connected by a trusted network - maybe even right next to each other, connected by a dedicated cable. I wanted a way to be able to, through configuration, tell the cluster that certain links between nodes in the cluster are trusted up to a certain level of sensitivity. Connections at that sensitivity level or below can use those links without needing encryption; anything above would use a suitably trusted algorithm.

The Node Tree

The configuration mechanism that came to mind was to organise the nodes into a tree, based on their proximity in terms of network connectivity, and physical containment within trust boundaries - two things that usually complement each other. Groups of nodes connected to the same LAN might be grouped together into subtrees, then LANs joined by private links grouped into larger subtrees, then those subtrees that communicate over the public Internet grouped into subtrees by country, so you can decide how much you trust their privacy laws; then, finally, the root of the tree bringing all the countries together, along with any nodes that are mobile devices used by travelling people so might crop up in any country. Actual cluster nodes are the leaves of the tree; interior nodes are just groupings for management purposes.

And then each interior node could be labelled with a sensitivity level. A communication path between any two leaf nodes would involving traversing one or more intermediate nodes, and the minimum sensitivity level of those intermediate nodes is the sensitivity level that path is trusted without, without needing encryption. So the interior node grouping a bunch of machines on a trusted LAN can be given a high sensitivity - while the root node, and any lower node corresponding to public Internet connections can be given a sensitivity of zero, indicating that they are not trusted with anything (without encryption).

(Dear future me: It would help if I drew some diagrams illustrating this, but I don't have time right now or this blog post will never get published)

Managing Node Trust

But I realised this also interacted nicely with another idea I'd been having. As well as communication security for data in transit, I designed into ARGON a similar model for the protection of data at rest; storage devices attached to a node can be trusted to hold data up to a certain sensitivity level without encryption, and data above that level will be encrypted. Where those two systems intersect, nodes themselves have a sensitivity level they are trusted up to; any data above that level of sensitivity will never be sent to that node by another node. I had originally planned to just store that sensitivity level against each node in a big list - but a better approach would be to also attach separate "absolute maximum sensitivity" trust levels to nodes in the node tree. Unlike the communications sensitivity levels, these would be optional; if a node did not have a maximum sensitivity then it would inherit the sensitivity of its parent; if this goes all the way to the root without finding a maximum sensitivity, then the result is positive infinity. And, on top of that, there is a rule: the maximum sensitivity of a node is not allowed to exceed the maximum sensitivity of any parent node.

And the point of all this is that the maximum sensitivity a leaf node is entrusted with cannot be higher than any maximum sensitivity attached to any parent node in the tree. So we can put a maximum sensitivity on an interior node representing a country, and data more sensitive than that will never be entrusted to nodes in that country, even if the person configuring a particular node makes a mistake.

As an aside: I've toyed with various ways of representing "sensitivity" labels; simple linear hierarchies, or maybe sets of labels like "Personal information" or "Trade secrets". as long as there's a way to compare sensitivities, to say "This sensitivity of data is less than this sensitivity level this thing is trusted with", the system works.

Broadcast Trees

In a distributed system, some process often needs to send a message to a large set of nodes, potentially every node - for instance, global configuration changes. If you do this by sending an individual message to every node, then that sending node sends potentially hundreds or thousands of effectively identical messages, differing only in their recipient address. This can be a waste of bandwidth for that node's connection to the greater network.

A better approach is to form a tree structure: send your message to a smaller set of nodes, each of which then forwards it on to another set of nodes, like a pyramid scheme: perhaps you send the message to ten nodes, and they forward it on to ten nodes each, so now 110 nodes have the message, and if each of them forwards it on to ten more nodes, you can quickly get a message to every node in an arbitrarily large cluster by distributing the work of sending. After all, part of the point of a distributed computing cluster is to share the work!

But you need to choose a structure for this broadcast tree. It's a bit silly to send a message to a node on the other side of the planet, which is then responsible for sending that message to a node that's actually sat right next to you on a fast LAN. This delays the receipt of that message to that nearby node compared to sending it directly, and makes more use of limited and potentially expensive long-distance links.

Needless to say, if we have a node tree following the approximate network layout of our nodes, we can use this to build a broadcast tree. For a start, you can reduce the task of sending a message to a list of nodes, by identifying any groups of target nodes which are members of the same interior node in the node tree; then you can just send a single copy to a random member of that group and ask it to forward it on to its peers in the same interior node, as it is presumably "close" to them in terms of latency and cost. If that reduced list is still large, you can start grouping together groups that share a nearby parent in the node tree, and so build a broadcast tree from the bottom up until your set of recipients is now small compared to the size of the cluster or there aren't any suitably groupings left in the node tree, and then just send the message to that reduced set, which will then forward it on to their network-nearby neighbours; reducing latency and making the best use of long-distance bandwidth.

(Dear future me: This, too, would be clarified nicely with a diagram...)

Distributed Caching

Now, around 2008 or so, I came across the notion of a distributed caching, in the form of memcached.

Let me explain a bit of background, to provide relevant context. Computers often have to compute things, and this is sometimes expensive. For instance, imagine your application needs to consider your records pertaining to a user to generate some kind of report for them, and this takes ten milliseconds. If that report is needed to render every page in your web application, then the naive approach of computing the report on every request would consume ten milliseconds on every request, even though it's fed the same inputs (unless the user's data has changed) to produce the same output each time.

If you have spare storage, it makes sense to record the result of this computation in a bit of storage known as a "cache", perhaps identified by a hash of the inputs to it. So if the user's data hasn't changed on the next request, the hash of the user's data matches an existing record in the cache, and we can just use it rather than re-computing it. When the user's data changes, the hash will change, so the report won't already be in the cache and it will be computed afresh then stored.

This would mean that the cache grows forever, accumulating old data that will never be requested once more. And the cost of storing a report that is used again in future, but not for months, is probably greater than the cost of regenerating it. So caches usually have some means of expunging data when a better use for the storage space comes up. A common approach is, when more space is needed (either by the cache itself, or by a more important user of storage than the cache), the cache deletes the least recently used items in the cache - as they're the ones least likely to be needed again soon.

Using a cache makes sense if the cost of the work done to make the things in the cache is high compared to the cost of storing them, and more immediately, the cost of accessing the cache itself. Caching the results of our ten-millisecond process in a robotic tape library that takes minutes to store and fetch items would make no sense at all. Caches are more attractive in fast storage (like system RAM), while slower storage (like spinning magnetic hard disks) only make economic sense to cache things that take more than the access times of disks to generate. A ten-millisecond report would benefit greatly from being cached in RAM, but might not help at all if cached on a disk under load.

Also, the effectiveness of a cache depends on its size compared to the demand for it. If our system has a lot of users using it at once, so thousands of different user reports are needed per second at peak times and a hundred thousand users use it most days, then the cache will help a great deal if it has space for hundreds of thousands of user reports at once so any user's report doesn't need regenerating more than once per day; but if it only has space for a few hundred, then reports will be being constantly regenerated because other reports end up pushing them out of the cache (according to the least-recently-used principle).

The idea behind memcached is to increase the amount of storage space available for a fast cache. As network latency and bandwidth had risen with the advent of affordable gigabit-and-above local area networks, memcached allowed a cluster of computers to pool their spare memory to form a shared cache; and because this means that many cache operations would need to talk over the network to access the memory in other computers, this meant it was a bit slower than local memory - but not by much, if the network is fast and small. This hit an economic sweet spot, however, particularly compared to having a separate cache on every computer. If incoming requests from users are randomly distributed between servers in a cluster, then with separate caches, every node ends up needing to cache a copy of that user's report. But with memcached, only one copy of it needs to be stored somewhere on the cluster, shared by all, and whatever node that user's request goes to, it can access the cached copy of the report, even if another node originally made it. Given the low cost of network communication, this hits an economic sweet spot for caching things that take a few milliseconds and up to generate - and scales nicely if the cluster grows.

Memcached, using memory, was tuned for small fast networks where the time taken to round-trip a message to another node isn't too many orders of magnitude above the cost of accessing local memory - so the benefit of sharing the cache justifies that added cost. Running memcached over an inter-continental cluster would make less sense, unless the work you're caching takes hundreds of milliseconds to regenerate. But such bigger jobs tend to be physically larger to store, and with the network latency so high there's less pressure to keep it all in memory, so you might as well store it on disks instead as they're cheaper than memory.

Scaling in the other direction, for very small fast jobs, it might make sense to operate something like memcached between processes running on a single server, using inter-process shared memory; that would offer very high bandwidth and low latency, that then might make it worth caching things that take mere tens of microseconds to generate from scratch.

Yet, sadly, we end up needing very different technologies to manage shared caches at these different scales, despite them implementing a very similar underlying concept.

So this got me thinking: could something like the ARGON node tree help here, too? If we have a unified caching mechanism across the entire cluster, but categorise jobs into "cost classes" of some kind (perhaps grouping them into millisecond jobs, ten-millisecond jobs, hundred-millisecond jobs, seconds-or-more jobs), we could then add configuration to the node tree to reflect what things to cache at what level. So interior nodes grouping a bunch of machines on a fast LAN together could be certified for millisecond-scale jobs, as we're confident we can round-trip a cache request between any two machines in that group in much less than a millisecond; while the root node might be rated only for seconds-or-more. Using this information, the cache system can, on any given node, thereby identify a group of neighbour nodes reachable within a given "cost class"; to cache an item in the millisecond class, it would use a distributed cache consisting only of nodes joined by the fast LAN (distinct from any other subtree elsewhere also joined by a millisecond rating), while ten-millisecond cached jobs might be distributed across a much wider group of nodes, perhaps going up to a national or continent boundary. That way you get bigger, albeit slower, caches for bigger things while still having small fast caches for small fast things.

This is really nice! The application developer only needs to annotate cacheable workloads by an approximate cost class, telling the caching system how much network latency/bandwidth it's worth spending on trying to avoid re-computing this job. The system administrator only needs to annotate the node tree with estimates of what size of job are worth caching how widely. The caching system can record actual metrics of how long it takes to regenerate jobs and how long it takes / how much bandwidth is consumed caching things, to feed back to both parties as to whether they need to refine their estimates. And the system will automatically do the best it can with the information it has available to it. Isn't that cool?

Modern relevance

We return from our flashback, to the present day...

And that's what brought this to the surface of my memory lately. I'd heard people complaining that some distributed systems were tailored to local clusters - low latency, lots of bandwidth, but then performing poorly when distributed across continents. While other algorithms worked well on a planetary scale, but didn't scale up in performance when run on a small local cluster.

But my proposed distributed-caching scheme would scale neatly. For nodes connected by a fast network, it could offer caching for small items; for nodes connected by a slow network, it would only cache large items. And as many levels in between as are required.

I think the node-tree configuration system, although not a perfect reflection of the structure of networks and trust boundaries, is an excellent approximation to it, while being a simple enough representation that it's easy for humans to edit without making mistakes too easily. And then using it as a basis to make decision as to how nodes communicate with each other allows distributed algorithms to automatically adapt themselves to the network topology, making it easier for them to scale up and down.

So, there we have it - designers of distributed systems, please consider this approach if it's useful to you!

No Comments

No comments yet.

RSS feed for comments on this post.

Leave a comment

WordPress Themes

Creative Commons Attribution-NonCommercial-ShareAlike 2.0 UK: England & Wales
Creative Commons Attribution-NonCommercial-ShareAlike 2.0 UK: England & Wales