Snell-Pym

A draft specification for IRIDIUM (by alaric)

As discussed in my previous post, I think it's lame that we use TCP for everything and think we could do much better!. Here's my concrete proposal for IRIDIUM, a protocol that I think could be a great improvement:

Underlying Transport

IRIDIUM assumes an underlying transport that can send datagrams up to some fixed size (which may not be known) to a specified address:port pair (where address and port can be anything, no particular representation is assumed); sent datagrams can be assigned a delivery priority (0-15, with 0 being the most important to deliver first) and a drop priority (0-15, with 0 being the most important to not drop), but the transport is free to ignore those. They're just hints.

You can also ask it to allocate a persistent source port (the transport gets to choose, not you), or to free one up when it's no longer needed, and to use that when sending datagrams rather than some arbitrary source port. The actual source port is not revealed by the underlying transport, nor is the source address, as it may not be known - Network Address Translation may be operating between the source and destination. All that matters is that it is consistent, and different to any other allocated source port from the same address.

The transport can also be able to be asked to listen on a specified port or address:port pair, and will return incoming datagrams that show up there. Replies to sent datagrams will also be returned, marked as either a reply to a datagram with no specific source port, or indicating which allocated source port the reply was received at (without necessarily revealing what the actual port is). Incoming datagrams may be provided with a "Congestion Notification" flag to warn that congestion is occurring within the network, and will all be marked with the source address and port they came from so that replies can be sent back.

The transport can also return errors where datagram sending failed; errors will specify the source and destination address and port of the datagram that caused the error.

(This pretty much describes UDP, when datagrams are sent with the Don't Fragment flag set and ECN enabled, and ICMP; but it could also be implemented as a raw IP protocol alongside UDP and TCP, or over serial links of various kinds, as will be described later).

The transport can also provide a "name resolution" mechanism, which converts an arbitrary string into an address:port pair.

Application-visible semantics

Given an address:port combination, you can:
1. Send a message of arbitrary size, with a drop priority (0-15) and a delivery priority (0-15). If the drop priority is 0, then the caller will be notified (eventually) of the message having arrived safely, or not. If the drop priority is non-zero, then the caller might be notified if the message is known not to arrive for any reason.
2. Send a request message of arbitrary size, with a delivery priority (0-15). The caller will be notified, eventually, of non-delivery or delivery, and if delivery is confirmed, it will at some point be given a response message, which can be of arbitrary size. The response message may arrive in fragments.
3. Request a connection, including a request message of arbitrary size, with a delivery priority (0-15). The caller will be notified, eventually, of non-delivery or delivery, and if delivery is confirmed, it will at some point be given a response message of arbitrary size and a connection context.
Set up a listener on a requested port. Given a listener, you can:
1. Be notified of incoming messages, tagged with the address:port they came from. They may arrive in fragments.
2. Be notified of incoming request messages (which may arrive in fragments), tagged with the address:port they came from, and once the request has fully arrived, be able to reply with a response message of arbitrary size.
3. Be notified of incoming connection requests (with a request message, that may arrive in fragments), tagged with the address:port they came from, and be able to reply with a response message of arbitrary size to yield a connection context, which you can immediately close if you don't want the connection.
4. Close it.
Given a connection context, you can:
1. Do any of the things you can do with an address:port combination (optionally requesting ordered delivery, as long as the drop priority of any message is 0 and the delivery priority of anything sent ordered is always consistent).
2. Do any of the things you can do with a listener (including closing it).
3. Be notified of the other end closing the connection.

Datagram format

Flag byte:
1. Top 4 bits: protocol version (0000 binary).
2. Lower 4 bits: datagram type (values defined below).
56-bit ID
Type of Service (ToS) byte:
1. Top 4 bits: drop priority.
2. Lower 4 bits: delivery priority.
Flags byte. From most significant to least significant bits:
1. Control chunks are present.
2. Ordered delivery (connection only).
3. Rest unused (must be set to 0).
If the "control chunks are present" flag is set, then we have one or more control chunks of this form:
1. Control chunk type byte:
  1. Top bit set means another control chunk follows; unset means this is the last.
  2. Bottom seven bits are the control chunk type (values defined below).
2. Control chunk contents (depending on the control chunk type):
  1. Type 0000000: Acknowledge
    1. 56-bit ID.
    2. 1 bit flag to indicate that the request being responded to arrived with the Congestion Notification flag set.
    3. 31 bits, encoding an unsigned integer: the number of microseconds between the request datagram arriving and the response datagram being sent.
  2. Type 0000001: Reject datagram
    1. 56-bit ID.
    2. Error code (8 bits).
    3. Optional error data, depending on error code:
      1. Type 00000000: CRC failure.
      2. Type 00000001: Drop priority was nonzero on a datagram other than type 0001 or 0110.
      3. Type 00000010: Spans were contiguous, overlapping, or extended beyond the datagram end in a type 1100 datagram.
      4. Type 00000011: Delivery priority differed between different ordered-delivery datagrams on the same connection.
      5. Type 00000100: Unknown datagram type code. Body is the type code in question, padded to 8 bits.
      6. Type 00000101: Unknown control chunk type. Body is the type code in question, padded to 8 bits.
      7. Type 00000110: Datagram arrived truncated at the recipient. Body is the length it was truncated at, as a 32-bit unsigned integer.
      8. Type 00000111: Datagram dropped due to overload at the recipient.
      9. Type 00001000: Temporary redirect. Body contains an address:port in the correct format for the underlying transport; caller should retry the datagram with the new destination.
      10. Type 00001001: connection ID not known for this source address:port / destination address:port combination.
      11. Type 00001010: Response received to unknown request ID.
      12. Type 00001011: Large datagram fragment is too small.
      13. Type 00001100: Large datagram begin rejected, it's too big and we don't have enough space to store it!
      14. Type 00001101: Large datagram fragment received, but we don't recognise the ID.
      15. Type 00001110: Ordered delivery flag was set on a datagram other than a connection message, request, or sub-connection open, or with a drop priority other than zero.
  3. Type 0000010: Acknowledge connection datagram
    1. 56-bit connection ID
    2. 56-bit datagram ID
    3. Acknowledgement data:
    4. Most significant bit is set to indicate that the request being responded to arrived with the Congestion Notification flag set.
    5. 31 bits, encoding an unsigned integer: the number of microseconds between the request datagram arriving and the response datagram being sent.
  4. Type 0000011: Reject connection datagram
    1. 56-bit connection ID
    2. 56-bit datagram ID
    3. Error code (8 bits).
    4. Error data, depending on error code, as per Type 0000001.
Then follows the datagram contents (depending on the datagram type), all the way up to the end of the datagram minus 32 bits:
1. Type 0000: Control chunks only - no contents.
2. Type 0001: Message: Application data.
3. Type 0010: Request: Application data.
4. Type 0011: Response:
  1. Embedded ACK:
    1. Most significant bit is set to indicate that the request being responded to arrived with the Congestion Notification flag set.
    2. 31 bits, encoding an unsigned integer: the number of microseconds between the request datagram arriving and the response datagram being sent.
  2. Application data.
5. Type 0100: Connection Open: Application data.
6. Type 0101: Connection Open Response:
  1. Embedded ACK (see above).
  2. Application data.
7. Type 0110: Connection Message:
  1. 56-bit Connection ID.
  2. Application data.
8. Type 0111: Connection Request:
  1. 56-bit Conection ID.
  2. Application data.
9. Type 1000: Connection Response:
  1. 56-bit connection ID.
  2. Embedded ACK (see above).
  3. Application data.
10. Type 1001: Sub-Connection Open:
  1. 56-bit connection ID.
  2. Application data.
11. Type 1010: Large datagram begin
  1. 4-bit inner datagram type (values marked with * are not valid)
  2. 4-bit size-of-size (unsigned integer). Eg, 0000 means a 16-bit size, 0001 means a 32-bit size, 0010 means a 64-bit size. Values above that are probably unneeded on current hardware.
  3. 16 * 2 ^ size-of-size bits of size (unsigned integer), the size of the inner datagram in bytes.
  4. Some prefix of the inner datagram, at least half the estimated maximum packet size.
12. Type 1011: Large datagram fragment
  1. 16 * 2 ^ size-of-size bits of offset (unsigned integer), the byte offset in the inner datagram that this fragment begins at.
  2. Some fragment of the inner datagram, at least half the estimated maximum packet size.
13. Type 1100: Large datagram ack
  1. 16 * 2 ^ size-of-size bits of offset (unsigned integer) of the last "Large datagram fragment" received (or all 0s if the last received fragment of this large datagram was the prefix in the "Large datagram begin")
  2. 1 bit: if set, the list of missing spans in the body of the datagram is complete. If not set, more will be sent later.
  3. 31 bits, encoding an unsigned integer: the number of microseconds between the latest fragment datagram arriving and this acknowledgement being sent.
  4. 16 bits (unsigned integer) for the number of fragments received since the last update.
  5. 16 bits (unsigned integer) for how many of those arrived with the Congestion Notification flag set.
  6. (Repeated until end of datagram contents):
    1. 16 * 2 ^ size-of-size bits of offset (unsigned integer) of a section of the large datagram that has not yet been received.
    2. 16 * 2 ^ size-of-size bits of length (unsigned integer) of that section.
    3. (The offsets must increase from start to end of the datagram, and no sections may be contiguous or overlapping).
14. Type 1101: Connection close. Contents is a 56-bit connection ID.
15. Type 1110: Large datagram cancel. No contents.
CRC32 of all of the previous bytes.

What the heck?

That's a lot to take in in one sitting, so I'm now going to go through what the IRIDIUM implementation needs to send in various situations.

Sending a small message to a host

Firstly, pick a random 56-bit ID. What you use for randomness is up to you, but it should be hard to predict in order to make spoofing tricky.

Wrap the message up in a type 0001 datagram, with your chosen delivery and drop priorities in the header, and send it off. If the drop priority was non-zero, then wait a while for a type 0000000 (acknowledge) control chunk to come back acknowledging it (the ID in the control chunk will match the ID we sent). If one doesn't come in a reasonable timeframe, send the datagram again, with the same ID.

If you get back a type 0000001 (rejection) control chunk, if the datagram was truncated, try using the large message process (and using the truncation point as an estimate of the path MTU). For any other rejection type, something probably got corrupted, so try again. Give up if you keep retrying for too long.

If the underlying transport returns a path MTU error, try using the large message process with the advised MTU as the path MTU estimate. If the underlying transport returned some other error, retry as before.

Receiving a small message

If you get a datagram with type 0001, you've received a message!

If the datagram is valid, and the ID isn't in your list of recently seen message IDs (otherwise, just ignore it), you can pass it to your application. If it has a zero drop priority, the send is expecting an acknowledgement, so craft a type 0000000 control chunk acknowledging it. You can send it in any datagram you were going to send to that address:port anyway, or create a type 0000 datagram just for it if there's no other traffic headed that way.

If the datagram is invalid, craft a type 0000001 (rejection) control chunk explaining why. Again, you can embed it in an existing datagram of create a type 0000 datagram just for it.

Sending a small request to a host

This starts off the same as sending a message: you pick an ID, and send your request in a datagram of type 0010 with the chosen delivery priority (drop priority isn't supported for requests).

As with the message case, keep retrying until you receive an acknowledgement. However, you may also directly receive a type 0011 datagram containing a response, which has an implicit acknowledgement inside it.

When you receive a type 0011 response datagram with the same ID as your request, send an acknowledgement in a type 0000000 control chunk.

Receiving a small request

If you get a datagram with type 0010, you've received a request!

If the datagram is invalid, craft a type 0000001 (rejection) control chunk explaining why and send it back.

If the datagram is valid, and the ID isn't in your list of recently seen request IDs (otherwise, just ignore it), you can pass it to your application and await a response.

If you don't get a response from the application within a reasonable timeframe, craft a type 0000000 control chunk to acknowledge the request and send it back.

When you get a response from the application, if it looks like it'll fit into a single datagram and be no larger than twice the size of the request datagram (to avoid amplification attacks), craft a type 0011 datagram with the same ID as the request and send the response back. As with sending a small message, you'll need to wait for acknowledgement, and retransmit the response if you don't get it or if you get a rejection back that looks like it was corruption (don't bother retransmitting if you get a type 00001010 error code, indicating the request ID wasn't recognised).

If your response looks too big for a datagram, or you got a rejection saying it was truncated or the underlying transport rejects it saying it was too big, you need to follow the "sending a large response" flow.

Sending a large message, request, or response

If you want to send a message, request, or response that you know won't fit in a single datagram, or you tried it and got got rejected for being too big or arriving truncated, you need to send it as a large datagram.

Form an estimate of the largest datagram you can send, using whatever information you have to hand. Take a guess.

Send a type 1010 datagram containing the details of the large datagram and as many of its initial bytes as you think you can fit. If this is a response to a single-datagram request outside of a connection, do not send an initial datagram more than twice the size of the request datagram (to mitigate amplification attacks).

If the type 1010 datagram gets acknowledged in a type 0000000 control chunk, start sending type 1011 datagrams with further sections of as many bytes as you think you can fit, in order; if it gets rejected by a type 0000001 control chunk, don't.

When the recipient starts sending back type 1100 "large datagram ack" datagrams, use that to update your knowledge of what fragments of the large datagram have been received. Any fragment that has been sent and not acknowledged within a reasonable timeframe should be retransmitted. If you never get any acknowledgements after a while, give up.

If you receive evidence that the datagrams are too large for the link, reduce the size of datagram you are sending. If you receive a rejection with error code 00001101, abort sending. If you want to give up sending, send a type 1110 large datagram cancel datagram with the same ID.

When the recipient has sent a type 1100 "large datagram ack" that is marked as complete and indicates that no fragments are missing, you can stop.

Receiving a large message, request, response, connection open, or connection open response.

If you receive a type 1010 datagram, you're getting a large datagram! You can reject it with error code 00001011 if it could easily have fitted into a normal datagram. Anyway, look at the ID and the type code and decide if you want to accept it or reject it (eg, if it's a response to an ID you don't have an oustanding request for, it's bogus). You might even examine the prefix of the body, which is included in the request, to make that decision.

If you want to reject it, send a suitable rejection control chunk. If you want to accept it, send an acknowledgement control chunk and pass the total length and prefix to the application as an incoming large datagram. Keep track of what sections of the large datagram you're missing, which will initially be everything except the prefix.

Any type 1010 or 1011 datagram that is too small can be rejected with a control chunk rejection code of 00001011. Most underlying transports have a minimum MTU, and datagrams smaller than that didn't need to be split into fragments, so anybody doing so is probably just trying to waste your resources. The exception, of course, being the final fragment of the datagram!

If you receive a type 1011 datagram with an ID that doesn't match a current large datagram receive in progress, reject it with a control chunk rejection code of 00001101.

All parts of a large datagram should come from the same source address:port and to the same destination port. If any don't, reject them with error code 00001101.

As type 1011 datagrams with the same ID flood in, pass them on to the application along with their fragment offsets, and subtract them from the list of missing sections - unless you already had that fragment, in which case ignore it. There may be an overlap, in which case, only send the new data (for instance, the sender sent two 16KiB fragments, only the second of which arrived; but the send mistakenly thought that it was sending too-large datagrams and tried again, sending 16KiB fragments - the first of which is entirely novel and can be sent to the application, the second of which overlaps the second 16KiB received originally, so the application should be sent only the first, missing, 6KiB; and the third one falls entirely inside the initially-received second 16KiB fragment so can be ignored).

At reasonable intervals, send a type 1100 datagram containing the offset of the most recently received fragment, how long since you received it, the count of fragments received since the last type 1100 datagram, how many of them had congestion warnings, and the current list of missing spans. If the whole list won't fit in a datagram, just cut it short and don't set the "this list is complete" bit; as we send the list in order, we will be omitting spans towards the end of the large datagram, and as the sender sends fragments in order, it is less likely to care about them yet anyway.

If you receive nothing for a while, give up. If you receive a type 1110 large datagram cancel datagram with the same ID, give up. If you want to give up for your own reasons, send a control chunk rejection with the ID of the large datagram and error code 00001101 indicating that you no longer recognise the ID.

When you have all of the large datagram, send a type 1100 datagram confirming that you are missing no parts (and make sure to mark it as complete information), to tell the sender they can finish.

Opening a top-level connection

Every connection should have a dedicated source port (and consistent source address), so ask the underlying transport to allocate you a consistent outgoing port and use it for all traffic pertaining to that top-level connection and all its subconnections. Any datagram pertaining to a particular connection that arrives on the wrong port should be rejected with an error code of 00001001.

To request opening a connection, pick an ID for the connection and send a type 0100 (connection open) datagram containing the initial application data accompanying the request. If it's too large for one datagram, use the large datagram process above.

Much as with a request, you'll either get back a type 1000 datagram with the response, or a control chunk of type 0000000 acknowledging it then a later type 1000 response - or maybe a control chunk of type 0000001 rejecting it. If you hear nothing within a reasonable timeframe, retry. The type 1000 datagram may be a large one, in which case you'll get it embedded inside type 1010 and 1011 datagrams, as usual. Report this back to the application, and keep a track of that connection ID.

Closing a connection

Send a type 1101 datagram with the connection ID inside it and its own unique ID. It must have the "ordered delivery" flag set if you want to ensure that any ordered datagrams sent are actually delivered, as it could arrive before them; or omit it if you just want the connection closed quickly. As usual, if you don't get an ACK, retry for a reasonable timeframe.

Answering a top-level connection request

If you receive a type 0100 datagram (possibly embedded in a large datagram), somebody wants to open a connection to you. Keep track of the source address:port, we'll need it later!

Pass the embedded request data to the application and await its response. If you don't get one soon, send an acknowledgement control chunk to stop the sender retrying. When you get a response, reply with a type 1000 datagram containing the response (again, embedding it in a large datagram if it's too big).

If the application doesn't really want the connection, it will express that somehow in its response, and immediately close the connection afterwards.

Sending unordered messages or requests responses, or connection closes within a connection

These follow exactly the same processes as above, except that we use different datagram types: type 0110 for a message, type 0111 for a request, and type 1000 for a response. We also use type 0000010 control chunks to acknowledge datagrams and type 0000011 to reject; all of these are the same as their normal types except they include an additional 56-bit connection ID.

Receiving unordered messages, requests, responses, or connection closes within a connection

These follow the same processes as above, except we use the different datagram types, and find the connection ID inside the extended datagram formats in order to know which application-level connection to report the datagram to. We also check that the incoming source address and port matches what we expect for the connection ID.

Opening sub-connections within a connection

Again, we follow the same process as opening a top-level connection, except we use a datagram of type 1001, including the parent connection ID. We use the normal connection accept/reject datagram/control chunk types for the rest of the protocol; the link between the connection and its parent is established only during the open process.

Sub-connections use the same source port as the top-level connection.

Answering a sub-connection request

This is just like opening a top-level connection request, but with a type 1001 datagram initiating the process, and indicating a parent connection ID which we use to tell the application what connection this is a sub-connection of.

Sending ordered messages, requests, or sub-connection opens within a connection

Ordered messages must have a drop priority of zero, and the delivery priority of all ordered datagrams must be consistent for the lifetime of a connection - whatever delivery priority we first use, all subsequent ones must be the same. You will receive a rejection with an error code of 00000011 for changing the delivery priority during a connection, or 00001110 for violating the static invariants.

To send ordered datagrams, follow the usual process for that type of datagram, but set the "ordered delivery" flag - and ensure that the ID of the datagram is one larger than the ID of the previous ordered datagram within that connection (with wraparound from 2^56-1 to 0). To keep that ID "available", the process of randomly selecting IDs for datagrams within a connection should avoid ones within some safety margin above the current last-ordered-datagram-ID. Of course, it's fine to re-use datagram IDs "after a while" (see below), and 56 bits is a large space for randomly-chosen values to not collide in, so this should be easy to ensure.

Note that the sequentially increasing ID constraint is independent for each direction of the connection.

Notes

Check source addresses!

As I have mentioned at various points above, but not exhaustively, we can be very picky that the source address and port of a datagram matches what we expect:

A response should come from the address:port we sent the request to.
Any control chunks pertaining to a datagram ID should come from the address:port we sent that datagram to originally.
All fragments, cancel datagrams, and control frames pertaining to a large datagram should come from the same address:port.
All datagrams pertaining to a top-level connection and all its subconnections should come from the same address:port.

This is a measure to make it harder to spoof datagrams to interfere with others' communications - but if a 56-bit random ID is sufficient for that alone, we could relax all those restrictions. This would have the following benefits:

Less storing of expected source ports in protocol state, saving memory.
Less checking of source ports against expected values, saving time.
Most importantly, if a mobile device changes IP address due to switching networks, its in-progress communications will succeed. Responses/acks/rejections/etc send to its old address will be lost into the void, but that will be handled like any other lost packet - the mobile device will retry sending things from its new IP, and the recipient will receive them and subsequent replies to those new datagrams will be routed to its new home.

There is one problem, though: the state for a connection, or an in-progress large datagram send in either direction, MUST include the address:port of the other end, so that new datagrams can be sent there. That is set when the connection or large-datagram starts, but at what point do we update it if we start receiving datagrams for that connection from a different address:port?

If an attacker guesses the connection ID, they could send a single datagram using that connection ID to "hijack" the connection, by having their source address:port set as the new far end of the connection so they receive all subsequent traffic; or an attacker could send a request that triggers a large datagram response and then spoof a packet with the same request ID from a victim host to bombard them with large-datagram fragment traffic.

The cryptographic layer I propose adding on top would fix that (see below) for connections, but more careful thought is required before enabling this. Perhaps when a change of source address:port is detected, communication on that connection/large datagram send should be paused and a probe datagram of some kind sent, containing a random value which must be returned (so a spoofing attacker can't fake the acknowledgement), before traffic resumes?

Note that this needs to happen for both ends of a large datagram transmission - sender or recipient could migrate. And, of course, once opened, connections are symmetrical - so with time, both ends of a connection could migrate to different addresses and ports, multiple times!

Picking IDs

Every datagram has an ID. They are not necessarily unique - a request and a response will have the same ID to tie them together, and the start, fragments, acknowledgement, and cancellation of a large datagram will all have the same ID. In fact, if a request is large and the response is large, then the same ID will be used for the process of sending the request in fragments and then for the process of sending the response back in fragments; these two processes can't overlap in time, so there is no ambiguity.

However, other than those allowed forms of re-use of the same ID because they pertain to related datagrams, IDs must be unique with the context of (for non-connection traffic) the source address:port and destination port, and (for connection traffic) the source address:port, destination port, and connection ID. Receivers will keep track of the IDs which further datagrams are expected for in any given context (eg, the response to a request that was sent out, or an in-progress large datagram transfer) so that incoming datagrams can be routed to the appropriate process, and they will also keep track of "recently used" IDs when those IDs are no longer expected again, so that any duplicated datagrams can be quietly rejected.

The "recently used" IDs should be kept for a reasonable time period - long enough for any lurking datagrams trapped in the network to have drained out. It is suggested that the process of picking a new ID be random (with the exception of ordered datagrams in a connection, although the initial ordered datagram ID in each direction in a connection should be random). This makes it harder for third parties in the network to forge rejection or close datagrams and mess with our communications. A sender could, for some extra robustness, check that a randomly allocated ID is not present in the list of current or recently used IDs in that context, as well as ensuring that a new ID is not too close to the sequential ID counter of the enclosing connection (if any) to risk a possible collision.

Analysis notes

Control chunks are never retransmitted; no record is made of which control chunks were carried with a datagram, so if a datagram is retransmitted, then it will just carry whatever control chunks were waiting at the time of the retransmit. Type 0000 datagrams are never acknowledged and hence never get retransmitted, nor are type 0001 or type 0110 (message and connection message, respectively) datagrams that have a nonzero drop priority, but all others are (although large datagram fragments have their own custom retransmit mechanism rather than per-datagram acknowledgements).

TODO

To make that a proper protocol specification, I need to:

Write each process as a proper state machine.
Define the exact response to every different kind of error code.
Clarify all the validity checks, and how to respond to them with an error, and what to do next.

A few improvements I've already thought of are:

Having duplicate datagram types for connection or connectionless versions of things, differing only in having a connection ID added, is perhaps over-complicated. Consider using one of those unused flag byte bits to say "Is it a connection datagram?", and if so, add a connection ID - only on datagram types that make sense inside a connection.
When a request has been sent and the application at the receiving end is taking a while to respond, an acknowledgement control chunk is sent back and then the caller waits forever for a reply. If the server disappears, the caller has no way of knowing. Also, the caller has no way of requesting that the server abort if the caller realises it doesn't need the response. The server end should be made to send additional acknowledgement control chunks at regular intervals to "keep-alive" the request, so the client can retry or give up if they stop appearing. Also, define a chunk type to cancel a request, sent with the same ID as the request.
The same applies for an open connection - if you've not sent anything on that connection within some reasonable timeframe, send a "keep-alive". If the far end hasn't sent you one within a reasonable timeframe, consider the connection closed by a failure (send the application an error rather than a normal close).
Make the acknowledgement control chunk include the datagram type that it's acknowledging as well as the ID, so datagram senders can differentiate acknowledgement of a request datagram and acknowledgement of the cancellation of that request, the latter so it can stop retransmitting the cancel.
Writing all those type codes in binary and then referring to them in the text by their binary numbers sucks. Use hexadecimal, and refer to the type names in the text rather than the numbers.
I should clarify the existing of a "transaction", that being any operation which requires some synchronisation between two peers. Sending a single-datagram message and awaiting an ACK is a transaction; sending a request and awaiting a response and ACKing the response is a transaction; sending a large message in fragments is a transaction; any connection is a transaction (wich sub-transactions for any of the above happening over that connection, or sub-connections), etc. Making these clearer makes it easier to analyse the protocol. The sections on sending/receiving each side of the transaction should probably be merged together to clarify this.
I have boldly spoken of underlying transport errors being detected in various cases in the spec, but of course we don't know exactly what datagram ID an error is returned by - just the source and destination address and port. This probably isn't a problem as the errors received generally apply to the host as a whole being unreachable or requiring smaller datagrams or the destination port not listening at all, which are broad-scope things that apply to all transactions in progress to that address or address:port pair, but the spec needs to clarify this.
To avoid connectionless responses to be used as an amplification attack (send a small datagram to an innocent server requesting a large response, but with a spoofed source address pointing at your victim, to flood them with unwanted traffic), I've mandated the the response datagram must be no bigger than twice the size of the request or it goes into the large-datagram mode, which sends an initial datagram (again, no larger than twice the size of the original) then awaits an ACK before sending the rest. Might it be worth including optional padding (which must be zeroed) in a connectionless response, so that senders can increase the allowance for single-datagram responses, in effect performing a "bandwidth proof-of-work" by using some of their outgoing bandwidth? I'll need to do the maths on the cost of the padding vs. the cost of extra round-trips and headers to send a small response as a multipart datagram.
I need to go through every datagram sent in the protocol and check that:
1. The effect of it going missing isn't catastrophic, eg something is retransmitted until the implementation gives up trying.
2. The effect of it being duplicated isn't catastrophic, eg the deduplication logic can make sense of it.
Should I include any useful functionality at the lower layers, that the normal IP layer fails to provide for us in a helpful manner? The section about source address validation already opens the question of embedding a connection mobility mechanism. The protocol is already designed so that an implementation connecting to a well-known public address and port should get bidirectional communication even if they are behind NAT, by carefully not caring about the source address of datagrams other than to route replies back; a mobility mechanism will make this more robust by allowing recovery from loss of NAT state in the network, treating it as a migration of the connection to a new source address:port. But should we embed STUN support?
To prevent large datagram responses being used as an amplification attack, I need to include a random value in the large datagram begin which must be returned in a special acknowledgement control chunk. Otherwise, an attacker can send a spoofed request followed by a spoofed large datagram begin acknowledgement control chunk, from the victim address.

How to implement this

The implementation of the above splits neatly into layers, which should make implementations easy to reason about. Of course the layers might get all smushed together in practice for performance reasons or something, but in theory at least, they can be isolated relatively cleanly.

UDPv4 underlying transport

The underlying transport semantics have a fairly obvious mapping to UDPv4. The UDPv4 transport needs to request ECN on outgoing packets by default, but it also needs to maintain some per-peer state (using the peer cache defined in the next layer) to detect non-ECN-capable transports and stop requesting ECN where it seems to be causing problems. See section 4 of RFC6679.

For name resolution, we should allow the caller to specify a service name and default port as well as the name. Then if we are passed a DNS hostname, we can attempt to look up the service via SRV records, or if that fails, look up an A record and use the default port. Passing in a raw IP address and skipping resolution (using the default port, or a specified port for IP:PORT strings), or a domain name and a port and skipping SRV lookup, should all be supported. This should work out of the box with mDNS .local hostname resolution if the underlying resolver supports it, of course, and DNS-SD is explicitly out of scope.

Flow control, retransmission, datagram encoding

The lowest level of the stack is the flow control engine, whose job is to send datagrams to and from the underlying transport, handling retransmission and flow control.

The packet formats and protocol above don't define how flow control happens, but they do provide the tools to do it; the actual flow control algorithm is up to the implementation, and may change without changing the protocol specification.

The implementation maintains a cache of known peers, based on the address only (not port). For each peer, we store its address, an estimated background packet loss rate, an estimated available bandwidth to that peer in bytes/tick (the tick is some arbitrary unit of time, perhaps 100ms), a leaky bucket counter in bytes, an estimated maximum datagram size (known as MTU), an estimated round-trip time (RTT), an observed datagram send rate in packets/second, an observed datagram retransmission rate in datagrams/second, an observed datagram congestion-notification rate in datagrams/second, an observed outgoing bandwidth usage in bytes/second, and a last-used timestamp. The peer cache also contains some data for use by the underlying transport for that peer (identified by the address type, if multiple underlying transports are in use), the format of which is opaque to this layer and is merely provided to the underlying transport on every operation.

I've hand-wavingly referred to "reasonable" timeframes for giving up on receiving a datagram or control chunk, in the protocol specification, and the estimated RTT to a peer is used to to tune this, plus some allowance for processing time at the destination - which also places an upper bound on how long an implementation can buffer control chunks for before sending them, and how long to wait for a response from the application before sending an explicit acknowledgement control chunk. I need to pick a reasonable processing timeframe and write it in the specification, as both sender and recipient need to agree on it (but allowing it to be negotiated in the protocol itself might allow for DoS attacks, by setting it unreasonably low to put a burden on the other end to respond quickly, or unreasonably high to disallow the peer from giving up quickly and throwing away state about things you never respond to).

When a request comes in to send to a previously unknown address, we initialise the peer entry with some sensible defaults: assume no background packet loss, and imagine we might have some default amount (say, a 100 kilobytes/second) of bandwidth; the MTU should be estimated (for the Internet, it's probably 1500) and the RTT set to some default (10ms?), and the observed datagram rates and leaky bucket counter should start at 0.

Peers can be dropped from the cache to make space, perhaps based on their last-used timestamp being excessively long ago.

This layer has a queue of outgoing datagrams from the layer above, and a queue of outgoing control chunks. If there are outgoing control chunks that have been waiting for longer than the maximum buffering time and no outgoing datagrams destined to the address:port pair that control chunk is for, then a datagram of type 0000 is automatically fabricated in the outgoing datagram queue. The outgoing queue is ordered by delivery priority, then datagrams that are retransmissions get to go before new datagrams, and datagrams with the same delivery priority and retransmission-ness are round-robin interleaved across IDs and connection IDs to ensure fair delivery.

Outgoing datagrams pulled from the top of the queue are inspected to see if they have any space left (by subtracting their size from the estimated MTU); as many control chunks destined to the same address:port as will fit are slipped into the datagram.

To send a datagram, the leaky bucket counter is inspected. If it's more than the estimated available send bandwidth, we wait until the send bandwidth has reduced. Once the leaky bucket counter is less than the estimated available send bandwidth, we send the datagram to the underlying transport and add its size to the leaky bucket counter. Update the datagrams sent and bytes sent rates, perhaps by using exponentially decaying moving averages.

Every tick, the leaky bucket counter is reduced by the available send bandwidth per tick, but never taking it below zero.

The implementation may choose to consider some datagrams as "urgent" - certainly not ones with application data in - and send them as soon as they're in the queue without waiting for the leaky bucket counter to reduce. Their size is still added to the leaky bucket counter, though. I've not yet thought about the situations where this should happen.

Datagram types that should be acknowledged - everything apart from type 0000 (control chunks only) and types 0001 or 0110 (messages) that have a non-zero drop priority - are kept in case they need to be retransmitted. Received datagrams are inspected to find acknowledgements and rejections (acknowledgements may be explicit control chunks, or implicit in responses), and the corresponding datagrams removed from the retransmission pool. They can also expire from the retransmission pool if no response ever arrives. Expiries are reported up the stack, just like rejections. Retransmissions are sent with the same delivery priority as the original datagram, but ahead of any other queued datagrams at the same delivery priority.

Acknowledgements are carefully examined. Every acknowledgement contains a congestion notification flag and a delay time in microseconds. The presence of the congestion notification flag should be tracked in the observed datagram congestion rate. The delay time should be subtracted from the measured time between when the original datagram was sent and the acknowledgement received, and considered a sample of the round-trip time; use an exponentially decaying moving average to update the current estimate. Rejections can be assumed to have a zero delay, and their raw round-trip time also used to update the RTT estimate.

When datagrams are retransmitted, update the average datagram retransmission rate. The ratio between this and the send rate gives us a recent datagram loss rate. We need to store a bit more state in the peer structure (I've not quite figured out what yet) to help it adjust the estimated send bandwidth up and down to try and find the point at which the datagram loss rate just starts to rise due to congestion, based on the observed datagram loss rate and the observed datagram congestion-notification rate, while being aware of underlying datagram loss due to link problems and not mistaking it for congestion, so we can obtain good utilisation of noisy links.

This layer is also responsible for managing the queuing of incoming datagrams being sent up to upper layers; if the incoming datagram queue is overflowing, datagrams should be dropped (those with the highest drop priority first) and rejections with error code 00000111 sent. The queue is, of course, ordered on delivery priority. It needs to silently discard duplicates, based on the datagram ID, type, and other type-specific fields which can be combined to form a "datagram de-duplication key" (DDDK). It needs to maintain a small cache of recently received DDDKs and reject any datagrams that arrive with the same DDDK.

When datagrams are converted to an actual byte stream to send to the underlying transport, which is the responsiblity of this layer, then any delay time fields in acknowledgements are filled in; this means that the datagram representation in the queue must include the receipt timestamp of the datagram it is in response to, and received datagrams must be timestamps as soon as they arrive, before being queued for upper layers to process.

Large datagrams / path MTU discovery

The previous layer gives us reliable flow-controlled datagram transport, but only for datagrams small enough to fit through the network.

This layer notices when datagrams sent by the layer above are too large for the estimated MTU of the destination peer, and converts them into fragments. It needs to watch the outgoing queue, not flooding it with fragments faster than they can be sent, but also keeping enough in there to keep the layer beneath supplied; perhaps fragment generation should be suspended whenever the queue size is above some threshold.

It also notices incoming underlying-transport errors pertaining to over-sized datagrams, and rejections due to datagrams arriving truncated, and reduces the MTU estimate to suit.

Perhaps it can also occasionally (using some extra state in the peer structure in the form of a last-MTU-update timestamp), if the MTU hasn't changed in some time, try increasing it a bit to see if it works. Sometimes the path MTU will increase, and it would be nice to not be "stuck" at a small MTU forever, harming efficiency.

And, finally, it's responsible for handling reassembly of incoming large datagrams, as described above. It should still issue the fragments as datagrams to the layer above (rather than trying to buffer a massive datagram in memory before passing anything up), but it will do all the duplicate/overlapping fragment removal and error handling as described in the protocol.

Communications between this layer and the one above need not be in the form of a queue, but direct calling - in effect, a queue with length zero.

Request/Response tracking

The next layer up can handle both sides of the request/response protocol; all layers below this have just dealt with datagrams flowing around in isolation, but here we tie requests and responses together. This is pretty trivial, so it might be worth merging it in with the next layer.

Connections

This layer handles keeping track of connections, implementing the connection open/close protocol, and attaching connection IDs to messages, requests, responses, and sub-connection opens within that connection.

ID allocation

To wrap all of the above into an API that can be used by actual applications, the only detail remaining is the ID allocator for new datagrams and connections!

Overall notes

The representation of datagrams (or application data blocks) used in the implementation should be some kind of list-of-segments rather than a block of contiguous memory. This means we can efficiently:

Prepend headers to blocks of data without copying the block of data
Generate fragment datagrams as references to subsequences of the original datagram, again without copying

The peer cache should, ideally, be shared across all users of the physical network interface so that they can share information about peers. On a UNIX system, it should be in some kind of shared memory (perhaps a small Redis instance?) so different processes can share it, if practical!

Future Extensions

There's a reason I included that four-bit protocol version field...

Multicast

Connectionless messages can map naturally to a multicast environment. If the underlying transport supports multicast addresses, then it would be possible to send a message to a multicast address (including large message datagrams having to be fragmented). As there is no scope for retransmission, then the information contained in the "large datagram begin" datagram would need to be replicated into every fragment, requiring a new datagram type code. Not all fragments might arrive, and applications would have to tolerate this.

Flow control would be a matter of being limited by the outgoing network interface, and either sending messages at some underlying rate (eg, sampling a sensor every second, or the data rate of a video stream), and maybe having some kind of degradation due to layering differential-quality-enhancement streams at different drop priorities.

Applications would need to be able to register to receive from a multicast address, a bit like opening a connection that they then receive messages on.

Forward error correction

If we can detect or predict high loss when sending multiple datagrams to a peer (or when sending to a multicast address), we might consider sending additional datagrams that contain error-correction information that can be used to reconstruct lost datagrams. This would probably involve collecting outgoing datagrams to the same destination into "groups" of some size, identified by a group ID, and adding some extra datagrams to the group, using a suitable forward error correction code.

If we notice a lot of CRC mismatch rejections coming back from hosts, we might also decide to start including some forward error correction data within each datagram, so that errors can be corrected!

Encrypted connections

I've kind of assumed that all the application data being thrown around will be signed and encrypted with some suitably secure technology that's beyond the responsibilities of a transport protocol, but there's one place where it would be beneficial to let encryption intrude at this level: the request/response data that gets to piggyback on the top-level connection open and connection open response could also include the first steps of a cryptographic mutual authentication and session key generation protocol, which can be carried on using connection messages and requests/responses if the connection open succeeds. Once a session key is established, then the connection's entire datagrams could be encrypted thereafter. If the connection can be identified by the source and destination address and port, then the appropriate decryption for that connection can be applied to received datagrams. To support this, the stack as presented above would require the ability for the encryption layer above it to specify a datagram-level encryption/decryption engine for a top-level connection, applied just between the flow control layer and the underlying transport, and applied to all datagrams once it has been enabled.

But this has an issue: To make the proposed connection mobility mechanism (not restricting connections to a fixed source address:port) work, we would need to identify connections by destination address:port alone - requiring a dedicated destination port per-connection, while the specification as given only requires connections to have a dedicated source port, allowing them to share the destination port.

To make that work, we could make the recipient of a top-level connection open request pick a dedicated port for the connection on their end, just as the sender of the open request picked a dedicated source port. The recipient then communicates that destination port back to the requester of the connection by ending the connection open response datagram from it, and when that datagram is received, its source address and port are used as the address to send datagrams within that connection to thereafter.

Note that this provides encryption/authentication purely for connections. Single request/response transactions are responsible for providing encrypted and/or authenticated requests and responses which, as they happen outside of a connection, will have no pre-existing security context unless one exists in some higher-level protocol (and I hope to make the connection mechanism in IRIDIUM good enough to largely make that unnecessary), so will need to be akin to PGP; fancy perfect-forward-secrecy ratchet protocols will only be possible via connection-based security protocols.

Serial transport

I have interest in communication via Unix domain sockets, stdin/stdout of subprocesses, over serial lines, and over point-to-point radio links. To do that, I'd like an underlying transport that handles point-to-point serial streams with possible byte loss. This would be a relatively simple matter of encoding datagrams using a suitable framing format that can resynchronise if bytes are lost, and either having null port identifiers or sticking a byte or two of port number in front of the datagrams if we want to multiplex things over a single link. There would be no inherent MTU of such a link, but we should enforce a configurable MTU to improve the responsiveness of multiplexed connections.

The name of the Unix domain socket, or the subprocess command, or the identifier of the serial port, would suffice as the address part for those transports.

UDPv4 VPN support

For cases where there's a pre-defined relationship between a group of hosts, it would be nice to define an encryption layer below IRIDIUM. You can do this at the operating system level with VPNs, which has the advantage of covering ALL traffic between those hosts, but is also a faff to set up and tends to be platform-specific. For IRIDIUM purposes only, the ability to set up an underlying transport on a group of hosts that joins a Tinc VPN entirely in userland could be useful?

UDPv6 transport

It would be a logical extension to support UDPv6 as well as UDPv4.

Other underlying transports

In theory, IRIDIUM could live alongside UDP and TCP as a native IP protocol. This would save us the 16 bits of the UDP checksum (which is superceded by our CRC-32), but that's not a huge deal. The downside would be that we'd need to implement IRIDIUM inside operating system network stacks and get that deployed, and likewise build support into packet shaping routers and firewalls, which would be a tiresome and unrewarding process to save 16 bits per datagram... so, I don't think so! Running on top of UDP is fine for now!

An implementation as a raw Ethernet frame type might be a fun exercise, and potentially useful in some embedded applications, but it's not on my radar personally.

An implementation on top of LoRa or HaLow might be more interesting, though!

Conclusion

This draft is open for discussion, before I go to the work of formalising all the edge cases by converting the protocol into exhaustive state machines, and implementing it. I particularly want to find:

Practical DoS attacks where third parties, knowing there is communication between two address:port pairs, can forge datagrams to disrupt that communication.
Practical DoS attacks where third parties can amplify their ability to overwhelm a target with data by sending datagrams to an IRIDIUM service with forged source addresses, causing response or error datagrams that are much larger than the original datagrams to be sent there.
Practical DoS attacks where, in a situation where multiple processes in different security contexts are all on the same host, a compromised process can use the shared peer flow-control state to disrupt communications for other processes (any more than they could normally, by just hogging lots of bandwidth).
Situations in the protocols where one side or the other can be left waiting forever, without timing out.
Any way it could be cooler, more efficient in latency or bandwidth consumption, or more adaptable to a wider range of applications.

Sarah and Alaric Snell-Pym living in interesting times