Category: ARGON

Node Trees: A model for configuring and managing large distributed systems (by )

(FX: Flashback wibbly wobbly transition...)

So, as a teenager, I started working on ARGON, a distributed operating system. At the time, the CPU cost of encrypting traffic could be a significant matter when communicating over untrusted networks, so I'd worked out a protocol whereby network communications between clusters could negotiate to find the lowest-cost encryption scheme that both parties considered acceptable for the sensitivity of the data being transmitted: more sensitive data would require more secure protocols, which presumably excluded cheaper ones.

But I wanted to do something similar for communications within a cluster; I started with the same idea - finding the cheapest algorithm considered secure enough for the communication at hand. This could be simplified, as all nodes within a cluster share the same configuration, so both will agree on the same list of encryption systems, with the same security and costliness scores; so no negotiation is required - the sender can work out what algorithm to use, and be confident that the recipient will come to the same conclusion.

However, it pained me that highly sensitive data would be encrypted with expensive algorithms, even between machines connected by a trusted network - maybe even right next to each other, connected by a dedicated cable. I wanted a way to be able to, through configuration, tell the cluster that certain links between nodes in the cluster are trusted up to a certain level of sensitivity. Connections at that sensitivity level or below can use those links without needing encryption; anything above would use a suitably trusted algorithm.

Read more »

Scoping generic functions (by )

So, my favourite model of object-oriented programming is "Generic Functions".

The idea is that, rather than the more widespread notion of "class-based object orientation" where methods are defined "inside" a class, the definition of types and the definition of methods on those types are kept separate. In practice, this means three different kinds of definitions:

  1. Defining types, which may well be class-like "record types with inheritance" and rules about what fields can be read/written in what scopes and all that, but could be any kind of type system as long as it defines some sort of "Is this an instance of this type?" relationship, possibly allowing subtyping (an object may be an instance of more than one type, but there's a "subtype" relationship between types that forms a lattice where any graph of types joined by subtype relationships has a single member that is not a subtype of any other member").
  2. Defining generic functions, by providing the structure of the argument list (but not the types of the arguments, although in systems with subtyping, there may be requirements made that some arguments' types are subtypes of some parent type) and the type of the return value and binding that to a name.
  3. Defining methods on a generic function, which are a mapping from a set of actual argument types to an implementation of the function, for a given generic function.

Note that the method refers to the type and the generic function, and is the only thing that "binds them together". Unlike in class-based OO, the definition of the type does not need to list all the operations available on that type. For instance, one module might define a "display something on the screen" generic function taking a thing and a display context as arguments; this module might be part of a user interface toolkit library. Another module might define a type for an address book entry, with a person or organisation's name and contact details. And then a third module might provide an implementation of the display-on-screen generic function for those address book entries. All three modules might well be written by different people, and only the third module needs to be aware that both the other modules exist; their authors might never hear of each other.

This is good for programmers, in my opinion, as it makes it easier to build systems out of separately-designed parts; it exhibits what is sometimes called "loose coupling". In a class-based system, the author of the address book type would either need to be aware of the user-interface toolkit and make sure their address book entry class also implemented the "display on a screen" interface and declare an implementation of the UI logic (which might not be their interest, especially if there's a large number of UI toolkits to choose from), or users of the address book class in combination with that UI toolkit would need to do the tiresome work of writing "wrapper classes" that contain an address book entry as an instance member, and then implement the display on a screen interface, and have to wrap/unwrap address book entries as they move in and out of user-interfacing parts of the application.

"Ah, but what if the user inherits from the address book entry class and implements the display-on-screen interface in their subclass?", you might say, but that's only a partial solution: sure, it gives you objects that are address book entries AND can be displayed on screen, but only if you explicitly construct an instance of that class rather than the generic address-book entry class - and third party code (such as parts of the address book library itself) wouldn't know to do that. Working around this with dependency injection frameworks is tedious, and success relies on every third-party component author bothering to use a DI framework instead of just instantiating classes in the way the language encourages them to do. An ugly solution, when generic functions solve the problem elegantly.

It also provides a natural model for multiple dispatch. Class-based "methods within classes" mean that every method is owned by one class, and methods are invoked on one object. In our address book UI example, the generic function to display things on screens accepts two arguments - the thing to display and a display context. In a class-based system, this means that the display method defined on our address book entry is passed a display context argument and can invoke operations on it defined by the display context class/interface/type, and if it wants different behaviour for displaying on a colour versus monochrome screen (remember them?) it needs to make that a runtime decision. However, in a generic function system, there would be separate subtypes of "display context" for "monochrome" and "colour", each defining different interfaces for controlling colours. This means you can provide separate methods on the display GF for an address book entry in colour or monochrome or, if you didn't need to worry about colour as you just displayed text in the default style, have a single implementation in terms of the generic "display context" supertype.

This feature is particularly welcome for people writing arithmetic libraries, who want to define multiplication between scalar and matrix, matrix and scalar, matrix and vector, vector and matrix, vector and scalar, scalar and vector, etc.

You can use run-time type information to implement all of this in a single-dispatch system, but (a) it's tedious typing (in both sense of the word) for the programmer, (b) it is not extensible (if somebody writes a "multiply" method in the "Matrix" class that knows to look for its argument being a scalar, vector, or other matrix, what is the author of a third-party "Quaternion" class to do to allow a Matrix to be multipled by a Quaternion?), (c) this robs the compiler of the opportunity to do really fancy optimisations it can do when it knows that this is a polymorphic generic function dispatch.

However, generic functions present a big problem for me, as an aspiring functional programming language author: scoping.

Read more »

A draft specification for IRIDIUM (by )

As discussed in my previous post, I think it's lame that we use TCP for everything and think we could do much better!. Here's my concrete proposal for IRIDIUM, a protocol that I think could be a great improvement:

Read more »

Configuring replication (by )

Storing all your data on one disk, or even inside one computer, is a risky thing to do. Anything stored in only one, small, physical location is all too easily destroyed by flood, fire, idiots, or deliberate action; and any one electronic device is prone to failure, as its continued functioning depends on the functioning of many tiny components that are not very easily replaced.

So it's sensible to store multiple copies, ideally in physically remote locations.

One way of doing this is by taking backups; this involves taking a copy of the data and putting it into a special storage system, such as compressed files on another disk, magnetic tape, a Ugarit vault, etc.

If the original data is lost, the backed-up data can't generally be used as-is, but has to be restored from the backup storage.

Another way is by replicating the data, which means storing multiple, equivalent, copies. Any of those copies can then be used to read the data, which is useful - there's no special restore process to get the data back, and if you have lots of requests to read the data, you can service those requests from your nearest copy of it (reducing delays and long-distance communication costs). Or you can spread the read workload across multiple copies in order to increase your total throughput.

Replication provides a better quality of service, but it has a downside; as all the copies are equally important, you can't use cheaper, slower, more compact storage methods for your extra copies, as you can with backups onto slower disks or tapes.

And then there's hybrid systems, perhaps were you have a primary copy and replicate onto slower disks as a "backup", while only using the primary copy for day-to-day use; if it fails then you switch to the slower "backup replica", and tolerate slower service until a new primary copy is made.

Traditionally, replicated storage systems such as HDFS require the administrator to specify a "replication factor", either system-wide or on a per-file basis. This is the number of replicas that must be made of the file. Two is the minimum to actually get any replication, but three is popular - if one replica is lost, then you still have two replicas to keep you going while you rebuild the missing replica, meaning you have to be unlucky and have two failures in quick succession before you're down to a single copy of anything.

However, this is a crude and nasty way of controlling replication. Needless to say, I've been considering how to configure replication of blocks within a Ugarit vault, and have designed a much fancier way.

For Ugarit replication, I want to cobble together all sorts of disks to make one large vault. I want to replicate data between disks to protect me against disk failures, and to make it possible to grow the vault by adding more disks, rather than having to transfer a single monolithic vault onto a larger disk when it gets full.

But as I'm a cheapskate, I'll be dealing with disks of varying reliability, capacity, and performance. So how do I control replication in such a complex, heterogeneous, environment?

What I've decided is to give each "shard" of the vault four configurable parameters.

The most interesting one is the "trust". This is a percentage. For a block to be considered sufficiently replicated, then copies of it must exist on enough shards that the sum of the trusts of the shards is more than or equal to 100%.

So a simple system with identical disks, where I want to replicate everything three times, can be had by giving each disk a trust of 34%; any three of them will sum to 102%, so every block will be copied three times.

But disks I trust less could be given a trust of 20%, requiring five copies if a block is stored only on such disks - or some combination of good and less-good disks.

That allows for simple homogeneous configurations, as well as complex heterogeneous ones, with a simple and intuitive configuration parameter. Nice!

The second is "write weighting". This is a dimensionless number, which defaults to 1 (it's not compulsory to specify it). Basically, when the system is given a block to store, it will pick shards at random until it has enough to meet the trust limit of 100%. But the write weighting is used as a weighting when making that random choice - a shard with a write weightinh of 2 will get twice as many blocks written to it as a normal block, on average.

So if I have two disks, one of which has 2TiB free and the other of which has 1TiB free, I can give a write weighting of 2 to the first one, and they'll fill so that they're both full at about the same time.

Of course, if I have disks that are now completely full in my vault, I can set their write weighting to 0 and they'll never be picked for writing new blocks to. They'll still be available for reading all the blocks they already have. If I left the write weighting untouched everything would still work, as the write requests failing would cause another shard to be picked for the write, but setting the weighting to 0 would speed things up by stopping the system from trying the write in the first place.

The third parameter is a read priority, which is also optional and defaults to 1. When a block must be read, the list of shards it's replicated on is looked up, and a shard picked in read priority order. If there are multiple shards with the same read priority, then one is picked at random. If the read fails, we repeat the process (excluding already-tried shards), so the read priority can be used to make sure we consult a fast, nearby, cheap-to-access local disk before trying to use a remote shard, for instance.

By default, all shards have the same read priority, so read requests will be randomly spread across them, sharing the load.

Finally, we have a read weighting, which defaults to 1. When we randomly pick a shard to read from, out of a set of alternatives with the same priority, we weight the random choice with this weighting. So if we have a disk that's twice as fast as another, we can give it twice the weighting, and on a busy system it'll get twice as many reads as the other, spreading the load fairly.

I like this approach, since it can be dumbed down to giving defaults for everything - 33% trust (for a three-way replication), and all the weightings and priorities at 1 (to spread everything evenly).

Or you can fine-tune it based on details of your available storage shards.

Or you can use extreme values for various special cases.

Got a "memcached backend" that offers fast storage, but will forget things? Give it a 0% trust and a high write weighting, so everything gets written there, but also gets properly replicated to stable storage; and give it a high read priority, so it gets checked first. Et voila, it's working as a cache.

Got 100% reliable storage shards, and just want to "stripe" them together to create a single, larger, one? Give them 100% trust, so every block is only written to one, but use read/write weightings to distribute load between them.

Got a read-only shard, perhaps due to its disk being full, or because you've explicitly copied it onto some protected read-only media (eg, optical) for security reasons? Just set the write weighting to 0, and it'll be there for reading.

Got some crazy combination of the above? Go for it!

Also, systems such as HDFS let you specify the replication factor on a per-file basis, requiring more replication for more important files (increasing the number of shard failures required to totally lose them) and to make them more widely avilable in the cluster (increasing the total read throughput available on that file, useful for small-but-widely-required files such as configuration or reference data). We can do that to! By default, every block written needs to be replicated enough to attain 100% trust - but this could be overriden on a per-block basis. Indeed, you could store a block on every shard by setting a trust target of "infinity"; normally, when given a trust target it can't meet (even with every shard), the system would do its best and emit a warning that the system is in danger, but a trust target of "infinity" should probably suppress that warning as it can be taken to mean "every shard".

The trust target of a block should be stored along with it, because the system needs to be able to check that blocks are still sufficiently replicated when shards are removed (or lost), and replicate them to new shards until every block has met its trust target again.

Tell me what you think. I designed this for Ugarit's replicated storage backend and WOLFRAM replicated storage in ARGON, but I think it could be a useful replication control framework in other projects, too.

The only extension I'm considering is having a write priority as well as a write weighting, just as we do with reads - because that would be a better way of enforcing all writes go to a "fast local cache" backend than just giving it a weighting of 99999999 or something, but I'm not sure it's necessary and four numbers is already a lot. What do you think?

A user interface design for a scrolling log viewer with varying levels of importance (by )

Like many people involved with computer programming and systems administration, I spend a lot of time looking at rapidly scrolling logs.

These logs tend to have lines of varying importance in them. This can fall into two kinds, that I see - one is where the lines have a "severity" (ranging from fatal errors down to debugging information). Another is where there's an explicit structure, with headings and subheadings.

Both suffer from a shared problem: important events or top-level headings whoosh past amidst a stream of minutae, and can be missed. A fatal error message can be obscured by thousands of routine notifications.

What I think might help is a tool that can be shoved in a pipe when viewing such a log, that uses some means (regexps, etc) to classify log lines with a numerical "importance" as appropriate, and then relaying them to the output.

However, it will use terminal control sequences to:

  1. Colour the lines according to their importance
  2. Ensure that the most recent entry at each level of importance remains onscreen, unless superceded by a later entry with a higher importance.

The latter deserves some explanation.

To start with, if we just have two levels of importance - ERROR and WARNING, for instance - it means that in a stream of output, as an ERROR scrolls up the screen, when it gets to the top it will "stick" and not scroll off, even while WARNINGs scroll by beneath it.

If a new ERROR appears at the bottom of the screen, it supercedes the old one, which can now disappear - letting the new ERROR scroll up until it hits the top and sticks.

Likewise, if you have three levels - ERROR, WARNING and INFO - then the most recent ERROR and WARNING will be stuck at the top of the screen (the WARNING below the ERROR) while INFOs scroll by. If a new WARNING appears, then the old one will unstick and scroll away until the new WARNING hits the top. If a new ERROR appears, then the old ERROR and WARNING at the top will become unstuck and scroll away until the new ERROR reaches the top.

So the screen is divided into two areas; the stuck things at the top, and the scrolling area at the bottom. Messages always scroll up through the scrolling area as they come, but any message that scrolls off the top will stick in the stuck things area unless there's another message at the same or higher level further down the scrolling area. And the emergence of a message into the bottom of the scrolling area automatically unsticks any message at that, or a less important, level from the stuck area.

That way, you can quickly look at the screen and see a scrolling status display, as well as (for activity logs from servers) the most recent FATAL, ERROR, WARNING, etc. message; or for the kinds of logs generated by long-running batch jobs, which tend to have lots of headings and subheadings, you'll always instantly see the headings/subheadings in effect for the log items you're reading.

This is related somewhat to the idea of having ERRORs and WARNINGs be situations with a beginning and an end (rather than just logged when they arise), such as "being low on disk space"; such a "situation alert" (rather than an event alert, as a single log message is) should linger on-screen somewhere until it's cancelled by the software that raised it emitting a corresponding "situation is over" event. Also related is the idea that event alerts above a certain severity should cause some kind of beeping/flashing to happen, which persists until manually stopped by pushing a button to acknowledge all current alerts. Such facilities can be integrated into the system.

This is relevant for a HYDROGEN console UI and pertinent to my previous thoughts on user interfaces for streams of events and programming interfaces to logging systems.

WordPress Themes

Creative Commons Attribution-NonCommercial-ShareAlike 2.0 UK: England & Wales
Creative Commons Attribution-NonCommercial-ShareAlike 2.0 UK: England & Wales