Category: ARGON

A draft specificion for IRIDIUM (by )

As discussed in my previous post, I think it's lame that we use TCP for everything and think we could do much better!. Here's my concrete proposal for IRIDIUM, a protocol that I think could be a great improvement:

Underlying Transport

IRIDIUM assumes an underlying transport that can send datagrams up to some fixed size (which may not be known) to a specified address:port pair (where address and port can be anything, no particular representation is assumed); sent datagrams can be assigned a delivery priority (0-15, with 0 being the most important to deliver first) and a drop priority (0-15, with 0 being the most important to not drop), but the transport is free to ignore those. They're just hints.

You can also ask it to allocate a persistent source port (the transport gets to choose, not you), or to free one up when it's no longer needed, and to use that when sending datagrams rather than some arbitrary source port. The actual source port is not revealed by the underlying transport, nor is the source address, as it may not be known - Network Address Translation may be operating between the source and destination. All that matters is that it is consistent, and different to any other allocated source port from the same address.

The transport can also be able to be asked to listen on a specified port or address:port pair, and will return incoming datagrams that show up there. Replies to sent datagrams will also be returned, marked as either a reply to a datagram with no specific source port, or indicating which allocated source port the reply was received at (without necessarily revealing what the actual port is). Incoming datagrams may be provided with a "Congestion Notification" flag to warn that congestion is occurring within the network, and will all be marked with the source address and port they came from so that replies can be sent back.

The transport can also return errors where datagram sending failed; errors will specify the source and destination address and port of the datagram that caused the error.

(This pretty much describes UDP, when datagrams are sent with the Don't Fragment flag set and ECN enabled, and ICMP; but it could also be implemented as a raw IP protocol alongside UDP and TCP, or over serial links of various kinds, as will be described later).

The transport can also provide a "name resolution" mechanism, which converts an arbitrary string into an address:port pair.

Application-visible semantics

  1. Given an address:port combination, you can:
    1. Send a message of arbitrary size, with a drop priority (0-15) and a delivery priority (0-15). If the drop priority is 0, then the caller will be notified (eventually) of the message having arrived safely, or not. If the drop priority is non-zero, then the caller might be notified if the message is known not to arrive for any reason.
    2. Send a request message of arbitrary size, with a delivery priority (0-15). The caller will be notified, eventually, of non-delivery or delivery, and if delivery is confirmed, it will at some point be given a response message, which can be of arbitrary size. The response message may arrive in fragments.
    3. Request a connection, including a request message of arbitrary size, with a delivery priority (0-15). The caller will be notified, eventually, of non-delivery or delivery, and if delivery is confirmed, it will at some point be given a response message of arbitrary size and a connection context.
  2. Set up a listener on a requested port. Given a listener, you can:
    1. Be notified of incoming messages, tagged with the address:port they came from. They may arrive in fragments.
    2. Be notified of incoming request messages (which may arrive in fragments), tagged with the address:port they came from, and once the request has fully arrived, be able to reply with a response message of arbitrary size.
    3. Be notified of incoming connection requests (with a request message, that may arrive in fragments), tagged with the address:port they came from, and be able to reply with a response message of arbitrary size to yield a connection context, which you can immediately close if you don't want the connection.
    4. Close it.
  3. Given a connection context, you can:
    1. Do any of the things you can do with an address:port combination (optionally requesting ordered delivery, as long as the drop priority of any message is 0 and the delivery priority of anything sent ordered is always consistent).
    2. Do any of the things you can do with a listener (including closing it).
    3. Be notified of the other end closing the connection.

Datagram format

  1. Flag byte:
    1. Top 4 bits: protocol version (0000 binary).
    2. Lower 4 bits: datagram type (values defined below).
  2. 56-bit ID
  3. Type of Service (ToS) byte:
    1. Top 4 bits: drop priority.
    2. Lower 4 bits: delivery priority.
  4. Flags byte. From most significant to least significant bits:
    1. Control chunks are present.
    2. Ordered delivery (connection only).
    3. Rest unused (must be set to 0).
  5. If the "control chunks are present" flag is set, then we have one or more control chunks of this form:
    1. Control chunk type byte:
      1. Top bit set means another control chunk follows; unset means this is the last.
      2. Bottom seven bits are the control chunk type (values defined below).
    2. Control chunk contents (depending on the control chunk type):
      1. Type 0000000: Acknowledge
        1. 56-bit ID.
        2. 1 bit flag to indicate that the request being responded to arrived with the Congestion Notification flag set.
        3. 31 bits, encoding an unsigned integer: the number of microseconds between the request datagram arriving and the response datagram being sent.
      2. Type 0000001: Reject datagram
        1. 56-bit ID.
        2. Error code (8 bits).
        3. Optional error data, depending on error code:
          1. Type 00000000: CRC failure.
          2. Type 00000001: Drop priority was nonzero on a datagram other than type 0001 or 0110.
          3. Type 00000010: Spans were contiguous, overlapping, or extended beyond the datagram end in a type 1100 datagram.
          4. Type 00000011: Delivery priority differed between different ordered-delivery datagrams on the same connection.
          5. Type 00000100: Unknown datagram type code. Body is the type code in question, padded to 8 bits.
          6. Type 00000101: Unknown control chunk type. Body is the type code in question, padded to 8 bits.
          7. Type 00000110: Datagram arrived truncated at the recipient. Body is the length it was truncated at, as a 32-bit unsigned integer.
          8. Type 00000111: Datagram dropped due to overload at the recipient.
          9. Type 00001000: Temporary redirect. Body contains an address:port in the correct format for the underlying transport; caller should retry the datagram with the new destination.
          10. Type 00001001: connection ID not known for this source address:port / destination address:port combination.
          11. Type 00001010: Response received to unknown request ID.
          12. Type 00001011: Large datagram fragment is too small.
          13. Type 00001100: Large datagram begin rejected, it's too big and we don't have enough space to store it!
          14. Type 00001101: Large datagram fragment received, but we don't recognise the ID.
          15. Type 00001110: Ordered delivery flag was set on a datagram other than a connection message, request, or sub-connection open, or with a drop priority other than zero.
      3. Type 0000010: Acknowledge connection datagram
        1. 56-bit connection ID
        2. 56-bit datagram ID
        3. Acknowledgement data:
        4. Most significant bit is set to indicate that the request being responded to arrived with the Congestion Notification flag set.
        5. 31 bits, encoding an unsigned integer: the number of microseconds between the request datagram arriving and the response datagram being sent.
      4. Type 0000011: Reject connection datagram
        1. 56-bit connection ID
        2. 56-bit datagram ID
        3. Error code (8 bits).
        4. Error data, depending on error code, as per Type 0000001.
  6. Then follows the datagram contents (depending on the datagram type), all the way up to the end of the datagram minus 32 bits:
    1. Type 0000: Control chunks only - no contents.
    2. Type 0001: Message: Application data.
    3. Type 0010: Request: Application data.
    4. Type 0011: Response:
      1. Embedded ACK:
        1. Most significant bit is set to indicate that the request being responded to arrived with the Congestion Notification flag set.
        2. 31 bits, encoding an unsigned integer: the number of microseconds between the request datagram arriving and the response datagram being sent.
      2. Application data.
    5. Type 0100: Connection Open: Application data.
    6. Type 0101: Connection Open Response:
      1. Embedded ACK (see above).
      2. Application data.
    7. Type 0110: Connection Message:
      1. 56-bit Connection ID.
      2. Application data.
    8. Type 0111: Connection Request:
      1. 56-bit Conection ID.
      2. Application data.
    9. Type 1000: Connection Response:
      1. 56-bit connection ID.
      2. Embedded ACK (see above).
      3. Application data.
    10. Type 1001: Sub-Connection Open:
      1. 56-bit connection ID.
      2. Application data.
    11. Type 1010: Large datagram begin
      1. 4-bit inner datagram type (values marked with * are not valid)
      2. 4-bit size-of-size (unsigned integer). Eg, 0000 means a 16-bit size, 0001 means a 32-bit size, 0010 means a 64-bit size. Values above that are probably unneeded on current hardware.
      3. 16 * 2 ^ size-of-size bits of size (unsigned integer), the size of the inner datagram in bytes.
      4. Some prefix of the inner datagram, at least half the estimated maximum packet size.
    12. Type 1011: Large datagram fragment
      1. 16 * 2 ^ size-of-size bits of offset (unsigned integer), the byte offset in the inner datagram that this fragment begins at.
      2. Some fragment of the inner datagram, at least half the estimated maximum packet size.
    13. Type 1100: Large datagram ack
      1. 16 * 2 ^ size-of-size bits of offset (unsigned integer) of the last "Large datagram fragment" received (or all 0s if the last received fragment of this large datagram was the prefix in the "Large datagram begin")
      2. 1 bit: if set, the list of missing spans in the body of the datagram is complete. If not set, more will be sent later.
      3. 31 bits, encoding an unsigned integer: the number of microseconds between the latest fragment datagram arriving and this acknowledgement being sent.
      4. 16 bits (unsigned integer) for the number of fragments received since the last update.
      5. 16 bits (unsigned integer) for how many of those arrived with the Congestion Notification flag set.
      6. (Repeated until end of datagram contents):
        1. 16 * 2 ^ size-of-size bits of offset (unsigned integer) of a section of the large datagram that has not yet been received.
        2. 16 * 2 ^ size-of-size bits of length (unsigned integer) of that section.
        3. (The offsets must increase from start to end of the datagram, and no sections may be contiguous or overlapping).
    14. Type 1101: Connection close. Contents is a 56-bit connection ID.
    15. Type 1110: Large datagram cancel. No contents.
  7. CRC32 of all of the previous bytes.

What the heck?

That's a lot to take in in one sitting, so I'm now going to go through what the IRIDIUM implementation needs to send in various situations.

Sending a small message to a host

Firstly, pick a random 56-bit ID. What you use for randomness is up to you, but it should be hard to predict in order to make spoofing tricky.

Wrap the message up in a type 0001 datagram, with your chosen delivery and drop priorities in the header, and send it off. If the drop priority was non-zero, then wait a while for a type 0000000 (acknowledge) control chunk to come back acknowledging it (the ID in the control chunk will match the ID we sent). If one doesn't come in a reasonable timeframe, send the datagram again, with the same ID.

If you get back a type 0000001 (rejection) control chunk, if the datagram was truncated, try using the large message process (and using the truncation point as an estimate of the path MTU). For any other rejection type, something probably got corrupted, so try again. Give up if you keep retrying for too long.

If the underlying transport returns a path MTU error, try using the large message process with the advised MTU as the path MTU estimate. If the underlying transport returned some other error, retry as before.

Receiving a small message

If you get a datagram with type 0001, you've received a message!

If the datagram is valid, and the ID isn't in your list of recently seen message IDs (otherwise, just ignore it), you can pass it to your application. If it has a zero drop priority, the send is expecting an acknowledgement, so craft a type 0000000 control chunk acknowledging it. You can send it in any datagram you were going to send to that address:port anyway, or create a type 0000 datagram just for it if there's no other traffic headed that way.

If the datagram is invalid, craft a type 0000001 (rejection) control chunk explaining why. Again, you can embed it in an existing datagram of create a type 0000 datagram just for it.

Sending a small request to a host

This starts off the same as sending a message: you pick an ID, and send your request in a datagram of type 0010 with the chosen delivery priority (drop priority isn't supported for requests).

As with the message case, keep retrying until you receive an acknowledgement. However, you may also directly receive a type 0011 datagram containing a response, which has an implicit acknowledgement inside it.

When you receive a type 0011 response datagram with the same ID as your request, send an acknowledgement in a type 0000000 control chunk.

Receiving a small request

If you get a datagram with type 0010, you've received a request!

If the datagram is invalid, craft a type 0000001 (rejection) control chunk explaining why and send it back.

If the datagram is valid, and the ID isn't in your list of recently seen request IDs (otherwise, just ignore it), you can pass it to your application and await a response.

If you don't get a response from the application within a reasonable timeframe, craft a type 0000000 control chunk to acknowledge the request and send it back.

When you get a response from the application, if it looks like it'll fit into a single datagram and be no larger than twice the size of the request datagram (to avoid amplification attacks), craft a type 0011 datagram with the same ID as the request and send the response back. As with sending a small message, you'll need to wait for acknowledgement, and retransmit the response if you don't get it or if you get a rejection back that looks like it was corruption (don't bother retransmitting if you get a type 00001010 error code, indicating the request ID wasn't recognised).

If your response looks too big for a datagram, or you got a rejection saying it was truncated or the underlying transport rejects it saying it was too big, you need to follow the "sending a large response" flow.

Sending a large message, request, or response

If you want to send a message, request, or response that you know won't fit in a single datagram, or you tried it and got got rejected for being too big or arriving truncated, you need to send it as a large datagram.

Form an estimate of the largest datagram you can send, using whatever information you have to hand. Take a guess.

Send a type 1010 datagram containing the details of the large datagram and as many of its initial bytes as you think you can fit. If this is a response to a single-datagram request outside of a connection, do not send an initial datagram more than twice the size of the request datagram (to mitigate amplification attacks).

If the type 1010 datagram gets acknowledged in a type 0000000 control chunk, start sending type 1011 datagrams with further sections of as many bytes as you think you can fit, in order; if it gets rejected by a type 0000001 control chunk, don't.

When the recipient starts sending back type 1100 "large datagram ack" datagrams, use that to update your knowledge of what fragments of the large datagram have been received. Any fragment that has been sent and not acknowledged within a reasonable timeframe should be retransmitted. If you never get any acknowledgements after a while, give up.

If you receive evidence that the datagrams are too large for the link, reduce the size of datagram you are sending. If you receive a rejection with error code 00001101, abort sending. If you want to give up sending, send a type 1110 large datagram cancel datagram with the same ID.

When the recipient has sent a type 1100 "large datagram ack" that is marked as complete and indicates that no fragments are missing, you can stop.

Receiving a large message, request, response, connection open, or connection open response.

If you receive a type 1010 datagram, you're getting a large datagram! You can reject it with error code 00001011 if it could easily have fitted into a normal datagram. Anyway, look at the ID and the type code and decide if you want to accept it or reject it (eg, if it's a response to an ID you don't have an oustanding request for, it's bogus). You might even examine the prefix of the body, which is included in the request, to make that decision.

If you want to reject it, send a suitable rejection control chunk. If you want to accept it, send an acknowledgement control chunk and pass the total length and prefix to the application as an incoming large datagram. Keep track of what sections of the large datagram you're missing, which will initially be everything except the prefix.

Any type 1010 or 1011 datagram that is too small can be rejected with a control chunk rejection code of 00001011. Most underlying transports have a minimum MTU, and datagrams smaller than that didn't need to be split into fragments, so anybody doing so is probably just trying to waste your resources. The exception, of course, being the final fragment of the datagram!

If you receive a type 1011 datagram with an ID that doesn't match a current large datagram receive in progress, reject it with a control chunk rejection code of 00001101.

All parts of a large datagram should come from the same source address:port and to the same destination port. If any don't, reject them with error code 00001101.

As type 1011 datagrams with the same ID flood in, pass them on to the application along with their fragment offsets, and subtract them from the list of missing sections - unless you already had that fragment, in which case ignore it. There may be an overlap, in which case, only send the new data (for instance, the sender sent two 16KiB fragments, only the second of which arrived; but the send mistakenly thought that it was sending too-large datagrams and tried again, sending 16KiB fragments - the first of which is entirely novel and can be sent to the application, the second of which overlaps the second 16KiB received originally, so the application should be sent only the first, missing, 6KiB; and the third one falls entirely inside the initially-received second 16KiB fragment so can be ignored).

At reasonable intervals, send a type 1100 datagram containing the offset of the most recently received fragment, how long since you received it, the count of fragments received since the last type 1100 datagram, how many of them had congestion warnings, and the current list of missing spans. If the whole list won't fit in a datagram, just cut it short and don't set the "this list is complete" bit; as we send the list in order, we will be omitting spans towards the end of the large datagram, and as the sender sends fragments in order, it is less likely to care about them yet anyway.

If you receive nothing for a while, give up. If you receive a type 1110 large datagram cancel datagram with the same ID, give up. If you want to give up for your own reasons, send a control chunk rejection with the ID of the large datagram and error code 00001101 indicating that you no longer recognise the ID.

When you have all of the large datagram, send a type 1100 datagram confirming that you are missing no parts (and make sure to mark it as complete information), to tell the sender they can finish.

Opening a top-level connection

Every connection should have a dedicated source port (and consistent source address), so ask the underlying transport to allocate you a consistent outgoing port and use it for all traffic pertaining to that top-level connection and all its subconnections. Any datagram pertaining to a particular connection that arrives on the wrong port should be rejected with an error code of 00001001.

To request opening a connection, pick an ID for the connection and send a type 0100 (connection open) datagram containing the initial application data accompanying the request. If it's too large for one datagram, use the large datagram process above.

Much as with a request, you'll either get back a type 1000 datagram with the response, or a control chunk of type 0000000 acknowledging it then a later type 1000 response - or maybe a control chunk of type 0000001 rejecting it. If you hear nothing within a reasonable timeframe, retry. The type 1000 datagram may be a large one, in which case you'll get it embedded inside type 1010 and 1011 datagrams, as usual. Report this back to the application, and keep a track of that connection ID.

Closing a connection

Send a type 1101 datagram with the connection ID inside it and its own unique ID. It must have the "ordered delivery" flag set if you want to ensure that any ordered datagrams sent are actually delivered, as it could arrive before them; or omit it if you just want the connection closed quickly. As usual, if you don't get an ACK, retry for a reasonable timeframe.

Answering a top-level connection request

If you receive a type 0100 datagram (possibly embedded in a large datagram), somebody wants to open a connection to you. Keep track of the source address:port, we'll need it later!

Pass the embedded request data to the application and await its response. If you don't get one soon, send an acknowledgement control chunk to stop the sender retrying. When you get a response, reply with a type 1000 datagram containing the response (again, embedding it in a large datagram if it's too big).

If the application doesn't really want the connection, it will express that somehow in its response, and immediately close the connection afterwards.

Sending unordered messages or requests responses, or connection closes within a connection

These follow exactly the same processes as above, except that we use different datagram types: type 0110 for a message, type 0111 for a request, and type 1000 for a response. We also use type 0000010 control chunks to acknowledge datagrams and type 0000011 to reject; all of these are the same as their normal types except they include an additional 56-bit connection ID.

Receiving unordered messages, requests, responses, or connection closes within a connection

These follow the same processes as above, except we use the different datagram types, and find the connection ID inside the extended datagram formats in order to know which application-level connection to report the datagram to. We also check that the incoming source address and port matches what we expect for the connection ID.

Opening sub-connections within a connection

Again, we follow the same process as opening a top-level connection, except we use a datagram of type 1001, including the parent connection ID. We use the normal connection accept/reject datagram/control chunk types for the rest of the protocol; the link between the connection and its parent is established only during the open process.

Sub-connections use the same source port as the top-level connection.

Answering a sub-connection request

This is just like opening a top-level connection request, but with a type 1001 datagram initiating the process, and indicating a parent connection ID which we use to tell the application what connection this is a sub-connection of.

Sending ordered messages, requests, or sub-connection opens within a connection

Ordered messages must have a drop priority of zero, and the delivery priority of all ordered datagrams must be consistent for the lifetime of a connection - whatever delivery priority we first use, all subsequent ones must be the same. You will receive a rejection with an error code of 00000011 for changing the delivery priority during a connection, or 00001110 for violating the static invariants.

To send ordered datagrams, follow the usual process for that type of datagram, but set the "ordered delivery" flag - and ensure that the ID of the datagram is one larger than the ID of the previous ordered datagram within that connection (with wraparound from 2^56-1 to 0). To keep that ID "available", the process of randomly selecting IDs for datagrams within a connection should avoid ones within some safety margin above the current last-ordered-datagram-ID. Of course, it's fine to re-use datagram IDs "after a while" (see below), and 56 bits is a large space for randomly-chosen values to not collide in, so this should be easy to ensure.

Note that the sequentially increasing ID constraint is independent for each direction of the connection.

Notes

Check source addresses!

As I have mentioned at various points above, but not exhaustively, we can be very picky that the source address and port of a datagram matches what we expect:

  1. A response should come from the address:port we sent the request to.
  2. Any control chunks pertaining to a datagram ID should come from the address:port we sent that datagram to originally.
  3. All fragments, cancel datagrams, and control frames pertaining to a large datagram should come from the same address:port.
  4. All datagrams pertaining to a top-level connection and all its subconnections should come from the same address:port.

This is a measure to make it harder to spoof datagrams to interfere with others' communications - but if a 56-bit random ID is sufficient for that alone, we could relax all those restrictions. This would have the following benefits:

  1. Less storing of expected source ports in protocol state, saving memory.
  2. Less checking of source ports against expected values, saving time.
  3. Most importantly, if a mobile device changes IP address due to switching networks, its in-progress communications will succeed. Responses/acks/rejections/etc send to its old address will be lost into the void, but that will be handled like any other lost packet - the mobile device will retry sending things from its new IP, and the recipient will receive them and subsequent replies to those new datagrams will be routed to its new home.

There is one problem, though: the state for a connection, or an in-progress large datagram send in either direction, MUST include the address:port of the other end, so that new datagrams can be sent there. That is set when the connection or large-datagram starts, but at what point do we update it if we start receiving datagrams for that connection from a different address:port?

If an attacker guesses the connection ID, they could send a single datagram using that connection ID to "hijack" the connection, by having their source address:port set as the new far end of the connection so they receive all subsequent traffic; or an attacker could send a request that triggers a large datagram response and then spoof a packet with the same request ID from a victim host to bombard them with large-datagram fragment traffic.

The cryptographic layer I propose adding on top would fix that (see below) for connections, but more careful thought is required before enabling this. Perhaps when a change of source address:port is detected, communication on that connection/large datagram send should be paused and a probe datagram of some kind sent, containing a random value which must be returned (so a spoofing attacker can't fake the acknowledgement), before traffic resumes?

Note that this needs to happen for both ends of a large datagram transmission - sender or recipient could migrate. And, of course, once opened, connections are symmetrical - so with time, both ends of a connection could migrate to different addresses and ports, multiple times!

Picking IDs

Every datagram has an ID. They are not necessarily unique - a request and a response will have the same ID to tie them together, and the start, fragments, acknowledgement, and cancellation of a large datagram will all have the same ID. In fact, if a request is large and the response is large, then the same ID will be used for the process of sending the request in fragments and then for the process of sending the response back in fragments; these two processes can't overlap in time, so there is no ambiguity.

However, other than those allowed forms of re-use of the same ID because they pertain to related datagrams, IDs must be unique with the context of (for non-connection traffic) the source address:port and destination port, and (for connection traffic) the source address:port, destination port, and connection ID. Receivers will keep track of the IDs which further datagrams are expected for in any given context (eg, the response to a request that was sent out, or an in-progress large datagram transfer) so that incoming datagrams can be routed to the appropriate process, and they will also keep track of "recently used" IDs when those IDs are no longer expected again, so that any duplicated datagrams can be quietly rejected.

The "recently used" IDs should be kept for a reasonable time period - long enough for any lurking datagrams trapped in the network to have drained out. It is suggested that the process of picking a new ID be random (with the exception of ordered datagrams in a connection, although the initial ordered datagram ID in each direction in a connection should be random). This makes it harder for third parties in the network to forge rejection or close datagrams and mess with our communications. A sender could, for some extra robustness, check that a randomly allocated ID is not present in the list of current or recently used IDs in that context, as well as ensuring that a new ID is not too close to the sequential ID counter of the enclosing connection (if any) to risk a possible collision.

Analysis notes

Control chunks are never retransmitted; no record is made of which control chunks were carried with a datagram, so if a datagram is retransmitted, then it will just carry whatever control chunks were waiting at the time of the retransmit. Type 0000 datagrams are never acknowledged and hence never get retransmitted, nor are type 0001 or type 0110 (message and connection message, respectively) datagrams that have a nonzero drop priority, but all others are (although large datagram fragments have their own custom retransmit mechanism rather than per-datagram acknowledgements).

TODO

To make that a proper protocol specification, I need to:

  1. Write each process as a proper state machine.
  2. Define the exact response to every different kind of error code.
  3. Clarify all the validity checks, and how to respond to them with an error, and what to do next.

A few improvements I've already thought of are:

  1. Having duplicate datagram types for connection or connectionless versions of things, differing only in having a connection ID added, is perhaps over-complicated. Consider using one of those unused flag byte bits to say "Is it a connection datagram?", and if so, add a connection ID - only on datagram types that make sense inside a connection.
  2. When a request has been sent and the application at the receiving end is taking a while to respond, an acknowledgement control chunk is sent back and then the caller waits forever for a reply. If the server disappears, the caller has no way of knowing. Also, the caller has no way of requesting that the server abort if the caller realises it doesn't need the response. The server end should be made to send additional acknowledgement control chunks at regular intervals to "keep-alive" the request, so the client can retry or give up if they stop appearing. Also, define a chunk type to cancel a request, sent with the same ID as the request.
  3. The same applies for an open connection - if you've not sent anything on that connection within some reasonable timeframe, send a "keep-alive". If the far end hasn't sent you one within a reasonable timeframe, consider the connection closed by a failure (send the application an error rather than a normal close).
  4. Make the acknowledgement control chunk include the datagram type that it's acknowledging as well as the ID, so datagram senders can differentiate acknowledgement of a request datagram and acknowledgement of the cancellation of that request, the latter so it can stop retransmitting the cancel.
  5. Writing all those type codes in binary and then referring to them in the text by their binary numbers sucks. Use hexadecimal, and refer to the type names in the text rather than the numbers.
  6. I should clarify the existing of a "transaction", that being any operation which requires some synchronisation between two peers. Sending a single-datagram message and awaiting an ACK is a transaction; sending a request and awaiting a response and ACKing the response is a transaction; sending a large message in fragments is a transaction; any connection is a transaction (wich sub-transactions for any of the above happening over that connection, or sub-connections), etc. Making these clearer makes it easier to analyse the protocol. The sections on sending/receiving each side of the transaction should probably be merged together to clarify this.
  7. I have boldly spoken of underlying transport errors being detected in various cases in the spec, but of course we don't know exactly what datagram ID an error is returned by - just the source and destination address and port. This probably isn't a problem as the errors received generally apply to the host as a whole being unreachable or requiring smaller datagrams or the destination port not listening at all, which are broad-scope things that apply to all transactions in progress to that address or address:port pair, but the spec needs to clarify this.
  8. To avoid connectionless responses to be used as an amplification attack (send a small datagram to an innocent server requesting a large response, but with a spoofed source address pointing at your victim, to flood them with unwanted traffic), I've mandated the the response datagram must be no bigger than twice the size of the request or it goes into the large-datagram mode, which sends an initial datagram (again, no larger than twice the size of the original) then awaits an ACK before sending the rest. Might it be worth including optional padding (which must be zeroed) in a connectionless response, so that senders can increase the allowance for single-datagram responses, in effect performing a "bandwidth proof-of-work" by using some of their outgoing bandwidth? I'll need to do the maths on the cost of the padding vs. the cost of extra round-trips and headers to send a small response as a multipart datagram.
  9. I need to go through every datagram sent in the protocol and check that:
    1. The effect of it going missing isn't catastrophic, eg something is retransmitted until the implementation gives up trying.
    2. The effect of it being duplicated isn't catastrophic, eg the deduplication logic can make sense of it.
  10. Should I include any useful functionality at the lower layers, that the normal IP layer fails to provide for us in a helpful manner? The section about source address validation already opens the question of embedding a connection mobility mechanism. The protocol is already designed so that an implementation connecting to a well-known public address and port should get bidirectional communication even if they are behind NAT, by carefully not caring about the source address of datagrams other than to route replies back; a mobility mechanism will make this more robust by allowing recovery from loss of NAT state in the network, treating it as a migration of the connection to a new source address:port. But should we embed STUN support?
  11. To prevent large datagram responses being used as an amplification attack, I need to include a random value in the large datagram begin which must be returned in a special acknowledgement control chunk. Otherwise, an attacker can send a spoofed request followed by a spoofed large datagram begin acknowledgement control chunk, from the victim address.

How to implement this

The implementation of the above splits neatly into layers, which should make implementations easy to reason about. Of course the layers might get all smushed together in practice for performance reasons or something, but in theory at least, they can be isolated relatively cleanly.

UDPv4 underlying transport

The underlying transport semantics have a fairly obvious mapping to UDPv4. The UDPv4 transport needs to request ECN on outgoing packets by default, but it also needs to maintain some per-peer state (using the peer cache defined in the next layer) to detect non-ECN-capable transports and stop requesting ECN where it seems to be causing problems. See section 4 of RFC6679.

For name resolution, we should allow the caller to specify a service name and default port as well as the name. Then if we are passed a DNS hostname, we can attempt to look up the service via SRV records, or if that fails, look up an A record and use the default port. Passing in a raw IP address and skipping resolution (using the default port, or a specified port for IP:PORT strings), or a domain name and a port and skipping SRV lookup, should all be supported. This should work out of the box with mDNS .local hostname resolution if the underlying resolver supports it, of course, and DNS-SD is explicitly out of scope.

Flow control, retransmission, datagram encoding

The lowest level of the stack is the flow control engine, whose job is to send datagrams to and from the underlying transport, handling retransmission and flow control.

The packet formats and protocol above don't define how flow control happens, but they do provide the tools to do it; the actual flow control algorithm is up to the implementation, and may change without changing the protocol specification.

The implementation maintains a cache of known peers, based on the address only (not port). For each peer, we store its address, an estimated background packet loss rate, an estimated available bandwidth to that peer in bytes/tick (the tick is some arbitrary unit of time, perhaps 100ms), a leaky bucket counter in bytes, an estimated maximum datagram size (known as MTU), an estimated round-trip time (RTT), an observed datagram send rate in packets/second, an observed datagram retransmission rate in datagrams/second, an observed datagram congestion-notification rate in datagrams/second, an observed outgoing bandwidth usage in bytes/second, and a last-used timestamp. The peer cache also contains some data for use by the underlying transport for that peer (identified by the address type, if multiple underlying transports are in use), the format of which is opaque to this layer and is merely provided to the underlying transport on every operation.

I've hand-wavingly referred to "reasonable" timeframes for giving up on receiving a datagram or control chunk, in the protocol specification, and the estimated RTT to a peer is used to to tune this, plus some allowance for processing time at the destination - which also places an upper bound on how long an implementation can buffer control chunks for before sending them, and how long to wait for a response from the application before sending an explicit acknowledgement control chunk. I need to pick a reasonable processing timeframe and write it in the specification, as both sender and recipient need to agree on it (but allowing it to be negotiated in the protocol itself might allow for DoS attacks, by setting it unreasonably low to put a burden on the other end to respond quickly, or unreasonably high to disallow the peer from giving up quickly and throwing away state about things you never respond to).

When a request comes in to send to a previously unknown address, we initialise the peer entry with some sensible defaults: assume no background packet loss, and imagine we might have some default amount (say, a 100 kilobytes/second) of bandwidth; the MTU should be estimated (for the Internet, it's probably 1500) and the RTT set to some default (10ms?), and the observed datagram rates and leaky bucket counter should start at 0.

Peers can be dropped from the cache to make space, perhaps based on their last-used timestamp being excessively long ago.

This layer has a queue of outgoing datagrams from the layer above, and a queue of outgoing control chunks. If there are outgoing control chunks that have been waiting for longer than the maximum buffering time and no outgoing datagrams destined to the address:port pair that control chunk is for, then a datagram of type 0000 is automatically fabricated in the outgoing datagram queue. The outgoing queue is ordered by delivery priority, then datagrams that are retransmissions get to go before new datagrams, and datagrams with the same delivery priority and retransmission-ness are round-robin interleaved across IDs and connection IDs to ensure fair delivery.

Outgoing datagrams pulled from the top of the queue are inspected to see if they have any space left (by subtracting their size from the estimated MTU); as many control chunks destined to the same address:port as will fit are slipped into the datagram.

To send a datagram, the leaky bucket counter is inspected. If it's more than the estimated available send bandwidth, we wait until the send bandwidth has reduced. Once the leaky bucket counter is less than the estimated available send bandwidth, we send the datagram to the underlying transport and add its size to the leaky bucket counter. Update the datagrams sent and bytes sent rates, perhaps by using exponentially decaying moving averages.

Every tick, the leaky bucket counter is reduced by the available send bandwidth per tick, but never taking it below zero.

The implementation may choose to consider some datagrams as "urgent" - certainly not ones with application data in - and send them as soon as they're in the queue without waiting for the leaky bucket counter to reduce. Their size is still added to the leaky bucket counter, though. I've not yet thought about the situations where this should happen.

Datagram types that should be acknowledged - everything apart from type 0000 (control chunks only) and types 0001 or 0110 (messages) that have a non-zero drop priority - are kept in case they need to be retransmitted. Received datagrams are inspected to find acknowledgements and rejections (acknowledgements may be explicit control chunks, or implicit in responses), and the corresponding datagrams removed from the retransmission pool. They can also expire from the retransmission pool if no response ever arrives. Expiries are reported up the stack, just like rejections. Retransmissions are sent with the same delivery priority as the original datagram, but ahead of any other queued datagrams at the same delivery priority.

Acknowledgements are carefully examined. Every acknowledgement contains a congestion notification flag and a delay time in microseconds. The presence of the congestion notification flag should be tracked in the observed datagram congestion rate. The delay time should be subtracted from the measured time between when the original datagram was sent and the acknowledgement received, and considered a sample of the round-trip time; use an exponentially decaying moving average to update the current estimate. Rejections can be assumed to have a zero delay, and their raw round-trip time also used to update the RTT estimate.

When datagrams are retransmitted, update the average datagram retransmission rate. The ratio between this and the send rate gives us a recent datagram loss rate. We need to store a bit more state in the peer structure (I've not quite figured out what yet) to help it adjust the estimated send bandwidth up and down to try and find the point at which the datagram loss rate just starts to rise due to congestion, based on the observed datagram loss rate and the observed datagram congestion-notification rate, while being aware of underlying datagram loss due to link problems and not mistaking it for congestion, so we can obtain good utilisation of noisy links.

This layer is also responsible for managing the queuing of incoming datagrams being sent up to upper layers; if the incoming datagram queue is overflowing, datagrams should be dropped (those with the highest drop priority first) and rejections with error code 00000111 sent. The queue is, of course, ordered on delivery priority. It needs to silently discard duplicates, based on the datagram ID, type, and other type-specific fields which can be combined to form a "datagram de-duplication key" (DDDK). It needs to maintain a small cache of recently received DDDKs and reject any datagrams that arrive with the same DDDK.

When datagrams are converted to an actual byte stream to send to the underlying transport, which is the responsiblity of this layer, then any delay time fields in acknowledgements are filled in; this means that the datagram representation in the queue must include the receipt timestamp of the datagram it is in response to, and received datagrams must be timestamps as soon as they arrive, before being queued for upper layers to process.

Large datagrams / path MTU discovery

The previous layer gives us reliable flow-controlled datagram transport, but only for datagrams small enough to fit through the network.

This layer notices when datagrams sent by the layer above are too large for the estimated MTU of the destination peer, and converts them into fragments. It needs to watch the outgoing queue, not flooding it with fragments faster than they can be sent, but also keeping enough in there to keep the layer beneath supplied; perhaps fragment generation should be suspended whenever the queue size is above some threshold.

It also notices incoming underlying-transport errors pertaining to over-sized datagrams, and rejections due to datagrams arriving truncated, and reduces the MTU estimate to suit.

Perhaps it can also occasionally (using some extra state in the peer structure in the form of a last-MTU-update timestamp), if the MTU hasn't changed in some time, try increasing it a bit to see if it works. Sometimes the path MTU will increase, and it would be nice to not be "stuck" at a small MTU forever, harming efficiency.

And, finally, it's responsible for handling reassembly of incoming large datagrams, as described above. It should still issue the fragments as datagrams to the layer above (rather than trying to buffer a massive datagram in memory before passing anything up), but it will do all the duplicate/overlapping fragment removal and error handling as described in the protocol.

Communications between this layer and the one above need not be in the form of a queue, but direct calling - in effect, a queue with length zero.

Request/Response tracking

The next layer up can handle both sides of the request/response protocol; all layers below this have just dealt with datagrams flowing around in isolation, but here we tie requests and responses together. This is pretty trivial, so it might be worth merging it in with the next layer.

Connections

This layer handles keeping track of connections, implementing the connection open/close protocol, and attaching connection IDs to messages, requests, responses, and sub-connection opens within that connection.

ID allocation

To wrap all of the above into an API that can be used by actual applications, the only detail remaining is the ID allocator for new datagrams and connections!

Overall notes

The representation of datagrams (or application data blocks) used in the implementation should be some kind of list-of-segments rather than a block of contiguous memory. This means we can efficiently:

  1. Prepend headers to blocks of data without copying the block of data
  2. Generate fragment datagrams as references to subsequences of the original datagram, again without copying

The peer cache should, ideally, be shared across all users of the physical network interface so that they can share information about peers. On a UNIX system, it should be in some kind of shared memory (perhaps a small Redis instance?) so different processes can share it, if practical!

Future Extensions

There's a reason I included that four-bit protocol version field...

Multicast

Connectionless messages can map naturally to a multicast environment. If the underlying transport supports multicast addresses, then it would be possible to send a message to a multicast address (including large message datagrams having to be fragmented). As there is no scope for retransmission, then the information contained in the "large datagram begin" datagram would need to be replicated into every fragment, requiring a new datagram type code. Not all fragments might arrive, and applications would have to tolerate this.

Flow control would be a matter of being limited by the outgoing network interface, and either sending messages at some underlying rate (eg, sampling a sensor every second, or the data rate of a video stream), and maybe having some kind of degradation due to layering differential-quality-enhancement streams at different drop priorities.

Applications would need to be able to register to receive from a multicast address, a bit like opening a connection that they then receive messages on.

Forward error correction

If we can detect or predict high loss when sending multiple datagrams to a peer (or when sending to a multicast address), we might consider sending additional datagrams that contain error-correction information that can be used to reconstruct lost datagrams. This would probably involve collecting outgoing datagrams to the same destination into "groups" of some size, identified by a group ID, and adding some extra datagrams to the group, using a suitable forward error correction code.

If we notice a lot of CRC mismatch rejections coming back from hosts, we might also decide to start including some forward error correction data within each datagram, so that errors can be corrected!

Encrypted connections

I've kind of assumed that all the application data being thrown around will be signed and encrypted with some suitably secure technology that's beyond the responsibilities of a transport protocol, but there's one place where it would be beneficial to let encryption intrude at this level: the request/response data that gets to piggyback on the top-level connection open and connection open response could also include the first steps of a cryptographic mutual authentication and session key generation protocol, which can be carried on using connection messages and requests/responses if the connection open succeeds. Once a session key is established, then the connection's entire datagrams could be encrypted thereafter. If the connection can be identified by the source and destination address and port, then the appropriate decryption for that connection can be applied to received datagrams. To support this, the stack as presented above would require the ability for the encryption layer above it to specify a datagram-level encryption/decryption engine for a top-level connection, applied just between the flow control layer and the underlying transport, and applied to all datagrams once it has been enabled.

But this has an issue: To make the proposed connection mobility mechanism (not restricting connections to a fixed source address:port) work, we would need to identify connections by destination address:port alone - requiring a dedicated destination port per-connection, while the specification as given only requires connections to have a dedicated source port, allowing them to share the destination port.

To make that work, we could make the recipient of a top-level connection open request pick a dedicated port for the connection on their end, just as the sender of the open request picked a dedicated source port. The recipient then communicates that destination port back to the requester of the connection by ending the connection open response datagram from it, and when that datagram is received, its source address and port are used as the address to send datagrams within that connection to thereafter.

Note that this provides encryption/authentication purely for connections. Single request/response transactions are responsible for providing encrypted and/or authenticated requests and responses which, as they happen outside of a connection, will have no pre-existing security context unless one exists in some higher-level protocol (and I hope to make the connection mechanism in IRIDIUM good enough to largely make that unnecessary), so will need to be akin to PGP; fancy perfect-forward-secrecy ratchet protocols will only be possible via connection-based security protocols.

Serial transport

I have interest in communication via Unix domain sockets, stdin/stdout of subprocesses, over serial lines, and over point-to-point radio links. To do that, I'd like an underlying transport that handles point-to-point serial streams with possible byte loss. This would be a relatively simple matter of encoding datagrams using a suitable framing format that can resynchronise if bytes are lost, and either having null port identifiers or sticking a byte or two of port number in front of the datagrams if we want to multiplex things over a single link. There would be no inherent MTU of such a link, but we should enforce a configurable MTU to improve the responsiveness of multiplexed connections.

The name of the Unix domain socket, or the subprocess command, or the identifier of the serial port, would suffice as the address part for those transports.

UDPv4 VPN support

For cases where there's a pre-defined relationship between a group of hosts, it would be nice to define an encryption layer below IRIDIUM. You can do this at the operating system level with VPNs, which has the advantage of covering ALL traffic between those hosts, but is also a faff to set up and tends to be platform-specific. For IRIDIUM purposes only, the ability to set up an underlying transport on a group of hosts that joins a Tinc VPN entirely in userland could be useful?

UDPv6 transport

It would be a logical extension to support UDPv6 as well as UDPv4.

Other underlying transports

In theory, IRIDIUM could live alongside UDP and TCP as a native IP protocol. This would save us the 16 bits of the UDP checksum (which is superceded by our CRC-32), but that's not a huge deal. The downside would be that we'd need to implement IRIDIUM inside operating system network stacks and get that deployed, and likewise build support into packet shaping routers and firewalls, which would be a tiresome and unrewarding process to save 16 bits per datagram... so, I don't think so! Running on top of UDP is fine for now!

An implementation as a raw Ethernet frame type might be a fun exercise, and potentially useful in some embedded applications, but it's not on my radar personally.

An implementation on top of LoRa or HaLow might be more interesting, though!

Conclusion

This draft is open for discussion, before I go to the work of formalising all the edge cases by converting the protocol into exhaustive state machines, and implementing it. I particularly want to find:

  1. Practical DoS attacks where third parties, knowing there is communication between two address:port pairs, can forge datagrams to disrupt that communication.
  2. Practical DoS attacks where third parties can amplify their ability to overwhelm a target with data by sending datagrams to an IRIDIUM service with forged source addresses, causing response or error datagrams that are much larger than the original datagrams to be sent there.
  3. Practical DoS attacks where, in a situation where multiple processes in different security contexts are all on the same host, a compromised process can use the shared peer flow-control state to disrupt communications for other processes (any more than they could normally, by just hogging lots of bandwidth).
  4. Situations in the protocols where one side or the other can be left waiting forever, without timing out.
  5. Any way it could be cooler, more efficient in latency or bandwidth consumption, or more adaptable to a wider range of applications.

Your comments are eagerly awaited 🙂

Configuring replication (by )

Storing all your data on one disk, or even inside one computer, is a risky thing to do. Anything stored in only one, small, physical location is all too easily destroyed by flood, fire, idiots, or deliberate action; and any one electronic device is prone to failure, as its continued functioning depends on the functioning of many tiny components that are not very easily replaced.

So it's sensible to store multiple copies, ideally in physically remote locations.

One way of doing this is by taking backups; this involves taking a copy of the data and putting it into a special storage system, such as compressed files on another disk, magnetic tape, a Ugarit vault, etc.

If the original data is lost, the backed-up data can't generally be used as-is, but has to be restored from the backup storage.

Another way is by replicating the data, which means storing multiple, equivalent, copies. Any of those copies can then be used to read the data, which is useful - there's no special restore process to get the data back, and if you have lots of requests to read the data, you can service those requests from your nearest copy of it (reducing delays and long-distance communication costs). Or you can spread the read workload across multiple copies in order to increase your total throughput.

Replication provides a better quality of service, but it has a downside; as all the copies are equally important, you can't use cheaper, slower, more compact storage methods for your extra copies, as you can with backups onto slower disks or tapes.

And then there's hybrid systems, perhaps were you have a primary copy and replicate onto slower disks as a "backup", while only using the primary copy for day-to-day use; if it fails then you switch to the slower "backup replica", and tolerate slower service until a new primary copy is made.

Traditionally, replicated storage systems such as HDFS require the administrator to specify a "replication factor", either system-wide or on a per-file basis. This is the number of replicas that must be made of the file. Two is the minimum to actually get any replication, but three is popular - if one replica is lost, then you still have two replicas to keep you going while you rebuild the missing replica, meaning you have to be unlucky and have two failures in quick succession before you're down to a single copy of anything.

However, this is a crude and nasty way of controlling replication. Needless to say, I've been considering how to configure replication of blocks within a Ugarit vault, and have designed a much fancier way.

For Ugarit replication, I want to cobble together all sorts of disks to make one large vault. I want to replicate data between disks to protect me against disk failures, and to make it possible to grow the vault by adding more disks, rather than having to transfer a single monolithic vault onto a larger disk when it gets full.

But as I'm a cheapskate, I'll be dealing with disks of varying reliability, capacity, and performance. So how do I control replication in such a complex, heterogeneous, environment?

What I've decided is to give each "shard" of the vault four configurable parameters.

The most interesting one is the "trust". This is a percentage. For a block to be considered sufficiently replicated, then copies of it must exist on enough shards that the sum of the trusts of the shards is more than or equal to 100%.

So a simple system with identical disks, where I want to replicate everything three times, can be had by giving each disk a trust of 34%; any three of them will sum to 102%, so every block will be copied three times.

But disks I trust less could be given a trust of 20%, requiring five copies if a block is stored only on such disks - or some combination of good and less-good disks.

That allows for simple homogeneous configurations, as well as complex heterogeneous ones, with a simple and intuitive configuration parameter. Nice!

The second is "write weighting". This is a dimensionless number, which defaults to 1 (it's not compulsory to specify it). Basically, when the system is given a block to store, it will pick shards at random until it has enough to meet the trust limit of 100%. But the write weighting is used as a weighting when making that random choice - a shard with a write weightinh of 2 will get twice as many blocks written to it as a normal block, on average.

So if I have two disks, one of which has 2TiB free and the other of which has 1TiB free, I can give a write weighting of 2 to the first one, and they'll fill so that they're both full at about the same time.

Of course, if I have disks that are now completely full in my vault, I can set their write weighting to 0 and they'll never be picked for writing new blocks to. They'll still be available for reading all the blocks they already have. If I left the write weighting untouched everything would still work, as the write requests failing would cause another shard to be picked for the write, but setting the weighting to 0 would speed things up by stopping the system from trying the write in the first place.

The third parameter is a read priority, which is also optional and defaults to 1. When a block must be read, the list of shards it's replicated on is looked up, and a shard picked in read priority order. If there are multiple shards with the same read priority, then one is picked at random. If the read fails, we repeat the process (excluding already-tried shards), so the read priority can be used to make sure we consult a fast, nearby, cheap-to-access local disk before trying to use a remote shard, for instance.

By default, all shards have the same read priority, so read requests will be randomly spread across them, sharing the load.

Finally, we have a read weighting, which defaults to 1. When we randomly pick a shard to read from, out of a set of alternatives with the same priority, we weight the random choice with this weighting. So if we have a disk that's twice as fast as another, we can give it twice the weighting, and on a busy system it'll get twice as many reads as the other, spreading the load fairly.

I like this approach, since it can be dumbed down to giving defaults for everything - 33% trust (for a three-way replication), and all the weightings and priorities at 1 (to spread everything evenly).

Or you can fine-tune it based on details of your available storage shards.

Or you can use extreme values for various special cases.

Got a "memcached backend" that offers fast storage, but will forget things? Give it a 0% trust and a high write weighting, so everything gets written there, but also gets properly replicated to stable storage; and give it a high read priority, so it gets checked first. Et voila, it's working as a cache.

Got 100% reliable storage shards, and just want to "stripe" them together to create a single, larger, one? Give them 100% trust, so every block is only written to one, but use read/write weightings to distribute load between them.

Got a read-only shard, perhaps due to its disk being full, or because you've explicitly copied it onto some protected read-only media (eg, optical) for security reasons? Just set the write weighting to 0, and it'll be there for reading.

Got some crazy combination of the above? Go for it!

Also, systems such as HDFS let you specify the replication factor on a per-file basis, requiring more replication for more important files (increasing the number of shard failures required to totally lose them) and to make them more widely avilable in the cluster (increasing the total read throughput available on that file, useful for small-but-widely-required files such as configuration or reference data). We can do that to! By default, every block written needs to be replicated enough to attain 100% trust - but this could be overriden on a per-block basis. Indeed, you could store a block on every shard by setting a trust target of "infinity"; normally, when given a trust target it can't meet (even with every shard), the system would do its best and emit a warning that the system is in danger, but a trust target of "infinity" should probably suppress that warning as it can be taken to mean "every shard".

The trust target of a block should be stored along with it, because the system needs to be able to check that blocks are still sufficiently replicated when shards are removed (or lost), and replicate them to new shards until every block has met its trust target again.

Tell me what you think. I designed this for Ugarit's replicated storage backend and WOLFRAM replicated storage in ARGON, but I think it could be a useful replication control framework in other projects, too.

The only extension I'm considering is having a write priority as well as a write weighting, just as we do with reads - because that would be a better way of enforcing all writes go to a "fast local cache" backend than just giving it a weighting of 99999999 or something, but I'm not sure it's necessary and four numbers is already a lot. What do you think?

A user interface design for a scrolling log viewer with varying levels of importance (by )

Like many people involved with computer programming and systems administration, I spend a lot of time looking at rapidly scrolling logs.

These logs tend to have lines of varying importance in them. This can fall into two kinds, that I see - one is where the lines have a "severity" (ranging from fatal errors down to debugging information). Another is where there's an explicit structure, with headings and subheadings.

Both suffer from a shared problem: important events or top-level headings whoosh past amidst a stream of minutae, and can be missed. A fatal error message can be obscured by thousands of routine notifications.

What I think might help is a tool that can be shoved in a pipe when viewing such a log, that uses some means (regexps, etc) to classify log lines with a numerical "importance" as appropriate, and then relaying them to the output.

However, it will use terminal control sequences to:

  1. Colour the lines according to their importance
  2. Ensure that the most recent entry at each level of importance remains onscreen, unless superceded by a later entry with a higher importance.

The latter deserves some explanation.

To start with, if we just have two levels of importance - ERROR and WARNING, for instance - it means that in a stream of output, as an ERROR scrolls up the screen, when it gets to the top it will "stick" and not scroll off, even while WARNINGs scroll by beneath it.

If a new ERROR appears at the bottom of the screen, it supercedes the old one, which can now disappear - letting the new ERROR scroll up until it hits the top and sticks.

Likewise, if you have three levels - ERROR, WARNING and INFO - then the most recent ERROR and WARNING will be stuck at the top of the screen (the WARNING below the ERROR) while INFOs scroll by. If a new WARNING appears, then the old one will unstick and scroll away until the new WARNING hits the top. If a new ERROR appears, then the old ERROR and WARNING at the top will become unstuck and scroll away until the new ERROR reaches the top.

So the screen is divided into two areas; the stuck things at the top, and the scrolling area at the bottom. Messages always scroll up through the scrolling area as they come, but any message that scrolls off the top will stick in the stuck things area unless there's another message at the same or higher level further down the scrolling area. And the emergence of a message into the bottom of the scrolling area automatically unsticks any message at that, or a less important, level from the stuck area.

That way, you can quickly look at the screen and see a scrolling status display, as well as (for activity logs from servers) the most recent FATAL, ERROR, WARNING, etc. message; or for the kinds of logs generated by long-running batch jobs, which tend to have lots of headings and subheadings, you'll always instantly see the headings/subheadings in effect for the log items you're reading.

This is related somewhat to the idea of having ERRORs and WARNINGs be situations with a beginning and an end (rather than just logged when they arise), such as "being low on disk space"; such a "situation alert" (rather than an event alert, as a single log message is) should linger on-screen somewhere until it's cancelled by the software that raised it emitting a corresponding "situation is over" event. Also related is the idea that event alerts above a certain severity should cause some kind of beeping/flashing to happen, which persists until manually stopped by pushing a button to acknowledge all current alerts. Such facilities can be integrated into the system.

This is relevant for a HYDROGEN console UI and pertinent to my previous thoughts on user interfaces for streams of events and programming interfaces to logging systems.

Thoughts on Programming and Tracing (by )

I was recently pointed at this interesting article: Learnable Programming.

It's a good read, overturning many assumptions the software industry has picked up over the years, and propagated without thought since.

The first part suggests allowing a programmer to trace the flow of execution of a program graphically, using an interactive timeline. My first thought was that this was all well and good, but would rely on every library in the language annotating every operation with information about how to present it - producing the little thumbnails to go in the timeline, or exposing numeric values that can be plotted onto charts. Also, highlighting the "current" drawing operation in red on the canvas relies on those operations being things that affect a canvas; more abstract operations, such as writing to a database (or even generating images to be encoded directly into a file rather than onto the screen) would require a more explicit "object preview".

However, those are not insurmountable goals. And, perhaps, things that can be built on top of my ideas about logging and tracing, making it possible to use such an interface to go through traces of execution captured from production servers, rather than just within a cute live-coding IDE; the trace entries generated by operations in your libraries could, with the help of a meta-library of trace visualisation rules, generate those little thumbnails. However, it would need to be augmented with dynamic scope information provided by the programming environment itself to know which line of code caused the trace event; the kind of thing one finds in a stack trace.

He asks "Another example. Most programs today manipulate abstract data structures and opaque objects, not pictures. How can we visualize the state of these programs?"; so I suggest that the abstract data structures and opaque objects be annotated with code that summarises their state. Many languages have a notion of "return a string representation of this object", generally aimed at debug logging - Python's repr() versus str(), for instance. Perhaps if we moved to expecting objects to return HTML representations of themselves, we could take a step in that direction.

The second part (and I'm taking some temporal liberties here, as some concepts I've included in the first part are touched upon in the second and vice versa) is also inspiring; it looks at the bigger picture, considering how libraries and code-editing environments can be designed to make it much easier for programmers to identify what operations their libraries are making available to them, rather than requiring the first step to be the reading of documentation. It touches on topics such as the dangers of mutable state (preaching to the converted here!), and the choice of library function names to make code using them clear (I'm also a big fan of smalltalk / Cocoa-style function call syntax, and how it might be brought into the Lisp family of languages...)

I've written before that I think modifying software should be a much more widely-practiced activity; and I think that should be achieved through removing unnecessary obstacles, rather than forcing everyone through complicated programming classes. I'm always interested in more thoughts on how to make that happen!

Insomnia (by )

There's something about the combination of having spent many weeks in a row without more than the odd half-hour here and there to myself (time when I get to do whatever I like, rather than merely choosing which of the list of things I need to get done urgently I will do next, or just having no choice at all), and knowing I need to get up even earlier the next morning than usual (to dive straight into a long day of scheduled activities), that makes it very, very, hard for me to sleep.

So, although I got to bed in good time for somebody who has to wake up at six o'clock, I have given up laying there staring at the ceiling, and come down to eat some more food (I get the munchies past midnight), read my book without disturbing Sarah with my bedside light, and potter on my laptop. I need to be up in five hours, so hopefully emptying my brain of whirling thoughts will enable me to sleep.

There's lots of things I want to do. Even though it's something I need to get done by a deadline, I'm actually enthusiastic about continuing the project I was working on today; making an enclosure for our chickens. This is necessary for us to be able to go away from the house for more than one night, which is something we want to do over Christmas; thus the deadline.

Three of the edges of the enclosure will be built onto existing walls or woodwork, but one of them needs to cut across some ground, so I've dug a trench across said bit of ground, laid an old concrete lintel and some concrete blocks in the trench after levelling the base with ballast, and then mixed and rammed concrete around them. When I next get to work on it, I'll mix up a large batch of concrete and use it to level the surface neatly (and then ram any left-overs into remaining gaps) to just below the level of the soil, then lay a row of engineering bricks (frog down) on a mortar bed on top of that in order to make a foundation that I can screw a wooden batten to. With that done, and some battens screwed into the tops of existing walls that don't already have woodwork on, I'll be able to build the frame of the enclosure (including a door), then attach fox-proof mesh to it, and our chickens will have a new home they can run around in safely.

Thinking about how I'm going to lay the next batch of concrete in a nice level run, working around the fact that I only have a short spirit level by placing a long piece of wood in there and levelling it with wedges and then using it as a reference to level the concrete to, has been one of the things running around in my head this evening.

Another has been the next steps from last Friday, when I had a fascinating meeting with a bunch of interesting people in the information security world. You see, I've always been interested in the foundation technologies upon which we build software, such as storage management, distributed computing, parallel computing, programming languages, operating systems, standard libraries, fault tolerance, and security. I was lucky enough to find a way into the world of database development a few years ago, which (with a move to a company that produces software to run SQL queries across a cluster) has broadened to cover storage management, distribution, parallelism, AND programming languages. So imagine my delight when said company starts to develop the security features in the product, and I can get involved in that; and even more when (through old contacts) I'm invited to the inaugural meeting of a prestigious group of peopled interested in security. That landed me an invite to the second meeting (chaired by an actual Lord, and held in the House of Lords!), the highlight of which was of course getting to talk to the participants after the presentations. I found out about the Global Identity Foundation, who are working pn standardising the kind of pseudonymous identity framework I have previous pined for; I'm going to see if I can find a way to get more involved in that. But I need to do a lot of reading-up on the organisations and people involved in this stuff, and figuring out how I can contribute to it with my time and money restrictions.

I'd really like to have some quiet time to work on my secret fiction project, too. And I want to investigate Ugarit bugs. Some bugs in the Chicken Scheme system have been found and fixed lately, so I need to re-test all these bugs to see if any of the more mysterious ones were artefacts of that. I'm in a bit of a vicious circle with that; the longer it is since I've been tinkering with the Ugarit internals, the longer it'll take me to get back into it, and the more nervous I feel about doing so. I think I might need to pick off some lighter bit of work with good rewards (adding a new feature, say) and handle that first, to get back into the swing of things. Either way, I'll need a good solid day to dig into it all again; trying to assemble that from sporadic hours just won't cut it.

I'm still mulling over issues in the design of ARGON. Right now I'm reading a book on handling updates to logical databases - adding new facts to them, and handling the conflicts when the new facts contradict older ones, in order to produce a new state of the database where the new fact is now true, but no contradictions remain. I need to work this out to settle on a final semantics for CARBON, which will be required to implement distributed storage of knowledge within TUNGSTEN. I need a semantics that can converge towards a consensus on the final state of the system, despite interruptions in internal network connectivity within the cluster causing updates to arrive in different orders in different places; doing that efficiently is, well, easier said than done.

I really want to finish rebuilding my furnace, which I hoped to get done this Summer, but I'm still assembling the structural supports for it. I've made a mould to cast shaped refractory bricks for the lining of the furnace, but I've yet to mix up the heatproof insulating material the bricks need to be made out of and start casting the bricks, as I still need to work out how I'll form the tuyere.

I want to get Ethernet cabled to my workshop, because currently I don't have a proper place for working on my laptop; I have to do it on the sofa in the lounge to be within range of the wifi, which isn't very ergonomic, doesn't give me access to my external screens, and is prone to interruption by children. I find it very motivating to be in "my space", too; the computer desk in the workshop is all set up the way I like it. And just for fun, I'd like to rig the workshop with computer-controlled sensors and gizmos (that kind of thing is a childhood dream of mine...).

This past year, I've tried booking two weekend days a month for my projects, in our shared calendar. This worked well at the start of the year, with projects such as the workshop ladder and eaves proceeding well, but it started to falter around the Summer when we got really busy with festivals and the like. I started having to fit half-days in around other things, which meant spending too much time getting started and clearing up compared to actually getting things done, so my morale faltered; and with so much other stuff on, I've been increasingly inclined to spend my free time just relaxing rather than getting anything done. On a couple of occasions I've tried taking a week off work to pursue my projects, but I then feel guilty about it and start allocating days to spending more time with the children or tidying the house, and before I know it, five days off becomes one day of actual project work. I need to stop feeling guilty about taking time to do the things I enjoy, because if I don't, I'll be too tired and miserable to do a good job of the things I should be doing! And rather than booking my monthly project days around other stuff that's going on, next year I'm going to mark out my two days each month in advance, and then move them elsewhere in the month if Sarah needs me to do something on that particular day, to decrease the chance of ending up having to scrape together half-days around the month (or to skip project days entirely, as I ended up doing last month). I feel awful about saying I'm going to spend days doing what I feel like doing rather than the things the rest of my family need me to drive them to, but if I don't, I think I'm going to fall apart!

Now... off and on I've spent forty minutes writing this blog post. So with my whirling thoughts dumped out, I'm going to go back to bed and see if I can sleep this time around. Wish me luck!

WordPress Themes

Creative Commons Attribution-NonCommercial-ShareAlike 2.0 UK: England & Wales
Creative Commons Attribution-NonCommercial-ShareAlike 2.0 UK: England & Wales