Category: ARGON

A draft specification for IRIDIUM (by )

As discussed in my previous post, I think it's lame that we use TCP for everything and think we could do much better!. Here's my concrete proposal for IRIDIUM, a protocol that I think could be a great improvement:

Read more »

Configuring replication (by )

Storing all your data on one disk, or even inside one computer, is a risky thing to do. Anything stored in only one, small, physical location is all too easily destroyed by flood, fire, idiots, or deliberate action; and any one electronic device is prone to failure, as its continued functioning depends on the functioning of many tiny components that are not very easily replaced.

So it's sensible to store multiple copies, ideally in physically remote locations.

One way of doing this is by taking backups; this involves taking a copy of the data and putting it into a special storage system, such as compressed files on another disk, magnetic tape, a Ugarit vault, etc.

If the original data is lost, the backed-up data can't generally be used as-is, but has to be restored from the backup storage.

Another way is by replicating the data, which means storing multiple, equivalent, copies. Any of those copies can then be used to read the data, which is useful - there's no special restore process to get the data back, and if you have lots of requests to read the data, you can service those requests from your nearest copy of it (reducing delays and long-distance communication costs). Or you can spread the read workload across multiple copies in order to increase your total throughput.

Replication provides a better quality of service, but it has a downside; as all the copies are equally important, you can't use cheaper, slower, more compact storage methods for your extra copies, as you can with backups onto slower disks or tapes.

And then there's hybrid systems, perhaps were you have a primary copy and replicate onto slower disks as a "backup", while only using the primary copy for day-to-day use; if it fails then you switch to the slower "backup replica", and tolerate slower service until a new primary copy is made.

Traditionally, replicated storage systems such as HDFS require the administrator to specify a "replication factor", either system-wide or on a per-file basis. This is the number of replicas that must be made of the file. Two is the minimum to actually get any replication, but three is popular - if one replica is lost, then you still have two replicas to keep you going while you rebuild the missing replica, meaning you have to be unlucky and have two failures in quick succession before you're down to a single copy of anything.

However, this is a crude and nasty way of controlling replication. Needless to say, I've been considering how to configure replication of blocks within a Ugarit vault, and have designed a much fancier way.

For Ugarit replication, I want to cobble together all sorts of disks to make one large vault. I want to replicate data between disks to protect me against disk failures, and to make it possible to grow the vault by adding more disks, rather than having to transfer a single monolithic vault onto a larger disk when it gets full.

But as I'm a cheapskate, I'll be dealing with disks of varying reliability, capacity, and performance. So how do I control replication in such a complex, heterogeneous, environment?

What I've decided is to give each "shard" of the vault four configurable parameters.

The most interesting one is the "trust". This is a percentage. For a block to be considered sufficiently replicated, then copies of it must exist on enough shards that the sum of the trusts of the shards is more than or equal to 100%.

So a simple system with identical disks, where I want to replicate everything three times, can be had by giving each disk a trust of 34%; any three of them will sum to 102%, so every block will be copied three times.

But disks I trust less could be given a trust of 20%, requiring five copies if a block is stored only on such disks - or some combination of good and less-good disks.

That allows for simple homogeneous configurations, as well as complex heterogeneous ones, with a simple and intuitive configuration parameter. Nice!

The second is "write weighting". This is a dimensionless number, which defaults to 1 (it's not compulsory to specify it). Basically, when the system is given a block to store, it will pick shards at random until it has enough to meet the trust limit of 100%. But the write weighting is used as a weighting when making that random choice - a shard with a write weightinh of 2 will get twice as many blocks written to it as a normal block, on average.

So if I have two disks, one of which has 2TiB free and the other of which has 1TiB free, I can give a write weighting of 2 to the first one, and they'll fill so that they're both full at about the same time.

Of course, if I have disks that are now completely full in my vault, I can set their write weighting to 0 and they'll never be picked for writing new blocks to. They'll still be available for reading all the blocks they already have. If I left the write weighting untouched everything would still work, as the write requests failing would cause another shard to be picked for the write, but setting the weighting to 0 would speed things up by stopping the system from trying the write in the first place.

The third parameter is a read priority, which is also optional and defaults to 1. When a block must be read, the list of shards it's replicated on is looked up, and a shard picked in read priority order. If there are multiple shards with the same read priority, then one is picked at random. If the read fails, we repeat the process (excluding already-tried shards), so the read priority can be used to make sure we consult a fast, nearby, cheap-to-access local disk before trying to use a remote shard, for instance.

By default, all shards have the same read priority, so read requests will be randomly spread across them, sharing the load.

Finally, we have a read weighting, which defaults to 1. When we randomly pick a shard to read from, out of a set of alternatives with the same priority, we weight the random choice with this weighting. So if we have a disk that's twice as fast as another, we can give it twice the weighting, and on a busy system it'll get twice as many reads as the other, spreading the load fairly.

I like this approach, since it can be dumbed down to giving defaults for everything - 33% trust (for a three-way replication), and all the weightings and priorities at 1 (to spread everything evenly).

Or you can fine-tune it based on details of your available storage shards.

Or you can use extreme values for various special cases.

Got a "memcached backend" that offers fast storage, but will forget things? Give it a 0% trust and a high write weighting, so everything gets written there, but also gets properly replicated to stable storage; and give it a high read priority, so it gets checked first. Et voila, it's working as a cache.

Got 100% reliable storage shards, and just want to "stripe" them together to create a single, larger, one? Give them 100% trust, so every block is only written to one, but use read/write weightings to distribute load between them.

Got a read-only shard, perhaps due to its disk being full, or because you've explicitly copied it onto some protected read-only media (eg, optical) for security reasons? Just set the write weighting to 0, and it'll be there for reading.

Got some crazy combination of the above? Go for it!

Also, systems such as HDFS let you specify the replication factor on a per-file basis, requiring more replication for more important files (increasing the number of shard failures required to totally lose them) and to make them more widely avilable in the cluster (increasing the total read throughput available on that file, useful for small-but-widely-required files such as configuration or reference data). We can do that to! By default, every block written needs to be replicated enough to attain 100% trust - but this could be overriden on a per-block basis. Indeed, you could store a block on every shard by setting a trust target of "infinity"; normally, when given a trust target it can't meet (even with every shard), the system would do its best and emit a warning that the system is in danger, but a trust target of "infinity" should probably suppress that warning as it can be taken to mean "every shard".

The trust target of a block should be stored along with it, because the system needs to be able to check that blocks are still sufficiently replicated when shards are removed (or lost), and replicate them to new shards until every block has met its trust target again.

Tell me what you think. I designed this for Ugarit's replicated storage backend and WOLFRAM replicated storage in ARGON, but I think it could be a useful replication control framework in other projects, too.

The only extension I'm considering is having a write priority as well as a write weighting, just as we do with reads - because that would be a better way of enforcing all writes go to a "fast local cache" backend than just giving it a weighting of 99999999 or something, but I'm not sure it's necessary and four numbers is already a lot. What do you think?

A user interface design for a scrolling log viewer with varying levels of importance (by )

Like many people involved with computer programming and systems administration, I spend a lot of time looking at rapidly scrolling logs.

These logs tend to have lines of varying importance in them. This can fall into two kinds, that I see - one is where the lines have a "severity" (ranging from fatal errors down to debugging information). Another is where there's an explicit structure, with headings and subheadings.

Both suffer from a shared problem: important events or top-level headings whoosh past amidst a stream of minutae, and can be missed. A fatal error message can be obscured by thousands of routine notifications.

What I think might help is a tool that can be shoved in a pipe when viewing such a log, that uses some means (regexps, etc) to classify log lines with a numerical "importance" as appropriate, and then relaying them to the output.

However, it will use terminal control sequences to:

  1. Colour the lines according to their importance
  2. Ensure that the most recent entry at each level of importance remains onscreen, unless superceded by a later entry with a higher importance.

The latter deserves some explanation.

To start with, if we just have two levels of importance - ERROR and WARNING, for instance - it means that in a stream of output, as an ERROR scrolls up the screen, when it gets to the top it will "stick" and not scroll off, even while WARNINGs scroll by beneath it.

If a new ERROR appears at the bottom of the screen, it supercedes the old one, which can now disappear - letting the new ERROR scroll up until it hits the top and sticks.

Likewise, if you have three levels - ERROR, WARNING and INFO - then the most recent ERROR and WARNING will be stuck at the top of the screen (the WARNING below the ERROR) while INFOs scroll by. If a new WARNING appears, then the old one will unstick and scroll away until the new WARNING hits the top. If a new ERROR appears, then the old ERROR and WARNING at the top will become unstuck and scroll away until the new ERROR reaches the top.

So the screen is divided into two areas; the stuck things at the top, and the scrolling area at the bottom. Messages always scroll up through the scrolling area as they come, but any message that scrolls off the top will stick in the stuck things area unless there's another message at the same or higher level further down the scrolling area. And the emergence of a message into the bottom of the scrolling area automatically unsticks any message at that, or a less important, level from the stuck area.

That way, you can quickly look at the screen and see a scrolling status display, as well as (for activity logs from servers) the most recent FATAL, ERROR, WARNING, etc. message; or for the kinds of logs generated by long-running batch jobs, which tend to have lots of headings and subheadings, you'll always instantly see the headings/subheadings in effect for the log items you're reading.

This is related somewhat to the idea of having ERRORs and WARNINGs be situations with a beginning and an end (rather than just logged when they arise), such as "being low on disk space"; such a "situation alert" (rather than an event alert, as a single log message is) should linger on-screen somewhere until it's cancelled by the software that raised it emitting a corresponding "situation is over" event. Also related is the idea that event alerts above a certain severity should cause some kind of beeping/flashing to happen, which persists until manually stopped by pushing a button to acknowledge all current alerts. Such facilities can be integrated into the system.

This is relevant for a HYDROGEN console UI and pertinent to my previous thoughts on user interfaces for streams of events and programming interfaces to logging systems.

Thoughts on Programming and Tracing (by )

I was recently pointed at this interesting article: Learnable Programming.

It's a good read, overturning many assumptions the software industry has picked up over the years, and propagated without thought since.

The first part suggests allowing a programmer to trace the flow of execution of a program graphically, using an interactive timeline. My first thought was that this was all well and good, but would rely on every library in the language annotating every operation with information about how to present it - producing the little thumbnails to go in the timeline, or exposing numeric values that can be plotted onto charts. Also, highlighting the "current" drawing operation in red on the canvas relies on those operations being things that affect a canvas; more abstract operations, such as writing to a database (or even generating images to be encoded directly into a file rather than onto the screen) would require a more explicit "object preview".

However, those are not insurmountable goals. And, perhaps, things that can be built on top of my ideas about logging and tracing, making it possible to use such an interface to go through traces of execution captured from production servers, rather than just within a cute live-coding IDE; the trace entries generated by operations in your libraries could, with the help of a meta-library of trace visualisation rules, generate those little thumbnails. However, it would need to be augmented with dynamic scope information provided by the programming environment itself to know which line of code caused the trace event; the kind of thing one finds in a stack trace.

He asks "Another example. Most programs today manipulate abstract data structures and opaque objects, not pictures. How can we visualize the state of these programs?"; so I suggest that the abstract data structures and opaque objects be annotated with code that summarises their state. Many languages have a notion of "return a string representation of this object", generally aimed at debug logging - Python's repr() versus str(), for instance. Perhaps if we moved to expecting objects to return HTML representations of themselves, we could take a step in that direction.

The second part (and I'm taking some temporal liberties here, as some concepts I've included in the first part are touched upon in the second and vice versa) is also inspiring; it looks at the bigger picture, considering how libraries and code-editing environments can be designed to make it much easier for programmers to identify what operations their libraries are making available to them, rather than requiring the first step to be the reading of documentation. It touches on topics such as the dangers of mutable state (preaching to the converted here!), and the choice of library function names to make code using them clear (I'm also a big fan of smalltalk / Cocoa-style function call syntax, and how it might be brought into the Lisp family of languages...)

I've written before that I think modifying software should be a much more widely-practiced activity; and I think that should be achieved through removing unnecessary obstacles, rather than forcing everyone through complicated programming classes. I'm always interested in more thoughts on how to make that happen!

Insomnia (by )

There's something about the combination of having spent many weeks in a row without more than the odd half-hour here and there to myself (time when I get to do whatever I like, rather than merely choosing which of the list of things I need to get done urgently I will do next, or just having no choice at all), and knowing I need to get up even earlier the next morning than usual (to dive straight into a long day of scheduled activities), that makes it very, very, hard for me to sleep.

So, although I got to bed in good time for somebody who has to wake up at six o'clock, I have given up laying there staring at the ceiling, and come down to eat some more food (I get the munchies past midnight), read my book without disturbing Sarah with my bedside light, and potter on my laptop. I need to be up in five hours, so hopefully emptying my brain of whirling thoughts will enable me to sleep.

There's lots of things I want to do. Even though it's something I need to get done by a deadline, I'm actually enthusiastic about continuing the project I was working on today; making an enclosure for our chickens. This is necessary for us to be able to go away from the house for more than one night, which is something we want to do over Christmas; thus the deadline.

Three of the edges of the enclosure will be built onto existing walls or woodwork, but one of them needs to cut across some ground, so I've dug a trench across said bit of ground, laid an old concrete lintel and some concrete blocks in the trench after levelling the base with ballast, and then mixed and rammed concrete around them. When I next get to work on it, I'll mix up a large batch of concrete and use it to level the surface neatly (and then ram any left-overs into remaining gaps) to just below the level of the soil, then lay a row of engineering bricks (frog down) on a mortar bed on top of that in order to make a foundation that I can screw a wooden batten to. With that done, and some battens screwed into the tops of existing walls that don't already have woodwork on, I'll be able to build the frame of the enclosure (including a door), then attach fox-proof mesh to it, and our chickens will have a new home they can run around in safely.

Thinking about how I'm going to lay the next batch of concrete in a nice level run, working around the fact that I only have a short spirit level by placing a long piece of wood in there and levelling it with wedges and then using it as a reference to level the concrete to, has been one of the things running around in my head this evening.

Another has been the next steps from last Friday, when I had a fascinating meeting with a bunch of interesting people in the information security world. You see, I've always been interested in the foundation technologies upon which we build software, such as storage management, distributed computing, parallel computing, programming languages, operating systems, standard libraries, fault tolerance, and security. I was lucky enough to find a way into the world of database development a few years ago, which (with a move to a company that produces software to run SQL queries across a cluster) has broadened to cover storage management, distribution, parallelism, AND programming languages. So imagine my delight when said company starts to develop the security features in the product, and I can get involved in that; and even more when (through old contacts) I'm invited to the inaugural meeting of a prestigious group of peopled interested in security. That landed me an invite to the second meeting (chaired by an actual Lord, and held in the House of Lords!), the highlight of which was of course getting to talk to the participants after the presentations. I found out about the Global Identity Foundation, who are working pn standardising the kind of pseudonymous identity framework I have previous pined for; I'm going to see if I can find a way to get more involved in that. But I need to do a lot of reading-up on the organisations and people involved in this stuff, and figuring out how I can contribute to it with my time and money restrictions.

I'd really like to have some quiet time to work on my secret fiction project, too. And I want to investigate Ugarit bugs. Some bugs in the Chicken Scheme system have been found and fixed lately, so I need to re-test all these bugs to see if any of the more mysterious ones were artefacts of that. I'm in a bit of a vicious circle with that; the longer it is since I've been tinkering with the Ugarit internals, the longer it'll take me to get back into it, and the more nervous I feel about doing so. I think I might need to pick off some lighter bit of work with good rewards (adding a new feature, say) and handle that first, to get back into the swing of things. Either way, I'll need a good solid day to dig into it all again; trying to assemble that from sporadic hours just won't cut it.

I'm still mulling over issues in the design of ARGON. Right now I'm reading a book on handling updates to logical databases - adding new facts to them, and handling the conflicts when the new facts contradict older ones, in order to produce a new state of the database where the new fact is now true, but no contradictions remain. I need to work this out to settle on a final semantics for CARBON, which will be required to implement distributed storage of knowledge within TUNGSTEN. I need a semantics that can converge towards a consensus on the final state of the system, despite interruptions in internal network connectivity within the cluster causing updates to arrive in different orders in different places; doing that efficiently is, well, easier said than done.

I really want to finish rebuilding my furnace, which I hoped to get done this Summer, but I'm still assembling the structural supports for it. I've made a mould to cast shaped refractory bricks for the lining of the furnace, but I've yet to mix up the heatproof insulating material the bricks need to be made out of and start casting the bricks, as I still need to work out how I'll form the tuyere.

I want to get Ethernet cabled to my workshop, because currently I don't have a proper place for working on my laptop; I have to do it on the sofa in the lounge to be within range of the wifi, which isn't very ergonomic, doesn't give me access to my external screens, and is prone to interruption by children. I find it very motivating to be in "my space", too; the computer desk in the workshop is all set up the way I like it. And just for fun, I'd like to rig the workshop with computer-controlled sensors and gizmos (that kind of thing is a childhood dream of mine...).

This past year, I've tried booking two weekend days a month for my projects, in our shared calendar. This worked well at the start of the year, with projects such as the workshop ladder and eaves proceeding well, but it started to falter around the Summer when we got really busy with festivals and the like. I started having to fit half-days in around other things, which meant spending too much time getting started and clearing up compared to actually getting things done, so my morale faltered; and with so much other stuff on, I've been increasingly inclined to spend my free time just relaxing rather than getting anything done. On a couple of occasions I've tried taking a week off work to pursue my projects, but I then feel guilty about it and start allocating days to spending more time with the children or tidying the house, and before I know it, five days off becomes one day of actual project work. I need to stop feeling guilty about taking time to do the things I enjoy, because if I don't, I'll be too tired and miserable to do a good job of the things I should be doing! And rather than booking my monthly project days around other stuff that's going on, next year I'm going to mark out my two days each month in advance, and then move them elsewhere in the month if Sarah needs me to do something on that particular day, to decrease the chance of ending up having to scrape together half-days around the month (or to skip project days entirely, as I ended up doing last month). I feel awful about saying I'm going to spend days doing what I feel like doing rather than the things the rest of my family need me to drive them to, but if I don't, I think I'm going to fall apart!

Now... off and on I've spent forty minutes writing this blog post. So with my whirling thoughts dumped out, I'm going to go back to bed and see if I can sleep this time around. Wish me luck!

WordPress Themes

Creative Commons Attribution-NonCommercial-ShareAlike 2.0 UK: England & Wales
Creative Commons Attribution-NonCommercial-ShareAlike 2.0 UK: England & Wales