The structure of network protocol suites (by )

It's always struck me as odd that IP routes a packet to a transport-level protocol such as UDP, ICMP, or TCP and then lets that protocol handle routing it to an application process.

Since an application is likely to require a few different types of service for different parts of its operations, shouldn't the specification of the target application be more important than the transport mechanism?

Wouldn't it make sense to be able to send a UDP request to 'port 80' just as easily as you can send a TCP request, rather than requiring the web server to bind separate UDP and TCP listener sockets?

These chains of thought led me to design the MERCURY protocol.

The 'endpoints' of MERCURY communications are, in the ARGON world, entities. However, the protocol doesn't care about this; it just needs to know that endpoints can be identified by some opaque identifier string. The MERCURY concept could be implemented outside of ARGON by using any kind of endpoint identifier, including TCP/UDP-esque 'port numbers'.

An endpoint provides a list of interfaces, each of which is identified by some sort of global name (eg, a URI, or a CARBON name). The interface name should, ideally, be resolvable to get a specification of the interface, because within each interface is a list of methods, identified merely by numbers; the interface specification can map these into human-readable names and metadata.

Each method has a type - and it's at this point where we get the distinction into TCP/UDP/RUDP/SCTP/DCCP transport types. The method types are:

  1. Asynchronous message sink. This is somewhat like a UDP listener; a message may be sent to it, with either a request for delivery to be guaranteed, or a drop priority otherwise. There is no reply message.
  2. RPC. This is somewhat like UDP, but with a reply message coming back from the server. Delivery is always guaranteed, since the system has to wait around for the reply anyway.
  3. Connection setup. Here's where it starts to get a bit interesting and different from normal protocol suites. A connection setup is like an RPC, in that a message is sent and a reply message returned, except that it may also set up a connection; the reply message is extended to include an optional connection ID (optional since the server may refuse the connection). The fact that this is merged into a normal RPC is to reduce the exposure to Denial-of-service attacks; in the TCP world, one often sets up a connection (consuming server resources) then uses that connection to attempt to authenticate, with the server rejecting the connection if you fail. With a connection setup RPC, the authentication details can be sent with the connection setup request and, if they are insufficient, the request can be rejected with no more server resources used than a normal connectionless RPC call; often just one IP packet in each direction.

Once a connection is established, it has its own interface on each end of the connection, defining a set of methods the other end of the connection can use. A pure client will have an empty interface, with no methods, but in general, both client and server can provide each other with message sinks and RPC methods - and even connection setup RPC methods, which create sub-connections nested within the connection.

The main effect of setting up a connection is to have a 'session' that state can be associated with, and to provide the option of ensuring ordering, because messages and RPCs sent over a session have the (per-use) option of requesting order enforcement, in which case they are assigned sequence numbers, and the recipient will ensure ordered delivery to the application. Ordered and unordered messages can be mixed freely within a connection. Also, when a connection is set up, it should be possible to attempt to reserve bandwidth for it; and, finally, a connection is maintained with heartbeats, and if either peer disappears, the survivor's application is notified promptly.

The definition of an interface defines the list of methods in that interface, and also attaches type information. Messages are structured, consisting of a list of fields, each with an identifying number; the interface definition gives names to those numbers, and specifies the type of that field. That applies to the message sent to message sinks and RPCs, and the reply messages of RPCs.

Connection setup RPCs have their request and reply message types defined, and the type of the connection defined; two interfaces have to be defined for the connection, each with their own method name->number mappings and message type declarations with their own field name->number and type mappings. Any connection setup RPCs within those interfaces will, also, have their own interface definitions, recursing as deeply as needs be.

Most applications that use TCP expend extra effort adding their own framing layer on top to make it an exchange of messages, with boundaries. These map naturally to the MERCURY model of delimited messages, and the MERCURY implementation can use the added information provided to be more efficient than TCP. However, for applications that really do deal with a stream of unstructured bytes, they can define a message sink that accepts byte arrays.

But there's more. In the TCP/IP world, applications have to implement their own failover/load balancing mechanisms. SMTP and DNS do this; the clients fetch a list of possible servers, and try them in turn until one works.

I think the transport layer should handle this for all applications. Since ARGON is based around clustering, it should be no surprise that I've put explicit clustering support in MERCURY. A MERCURY endpoint is identified by a cluster identifier and the opaque cluster-local endpoint ID. The cluster identifier consists of a unique ID for the cluster, a version number, and a list of node addresses - IPv4, IPv6, or potentially other address families in future. Each node address is annotated with a priority and a share.

When initiating communications, the client should find the group of nodes with the highest priority in the list, and pick one by weighted-random, using the share of each node as its weighting. If the communication is anything other than a non-guaranteed-delivery message sink send and it fails with an error or a timeout, the client should take the failed node off of the list, and repeat the process; if that was the last node at the highest priority, then a new priority level becomes highest, until there are no more nodes to try.

A client may, optionally, alter the weightings for nodes at the same priority by including a guess of the network distance between itself and each node, perhaps the number of bits of IPv4 address prefix in common. Some experimentation should be done to find out how significant that alteration should be.

The server-originated failure message may, optionally, have a redirect, suggesting a different node to try next. If so, that overrides the prioritised weighted choice, but if that node fails (without suggesting a redirect) the algorithm continues as usual.

On every message sent to a server, the cluster identifier version number is sent. Whenever nodes are added to or removed from the cluster, the cluster identifier version number is incremented. If a server receives a request with an outdate cluster identifier version number, it includes in its response the latest cluster identifier, with a new version number and list of nodes. The client should update its cluster identifier, ideally across all endpoint identifiers from the same cluster.

A non-ARGON version of MERCURY could use the same trick, or use DNS - explicitly handling the case of 'multiple A/SRV records returned' by going through them in order.

No Comments

No comments yet.

RSS feed for comments on this post.

Leave a comment

WordPress Themes

Creative Commons Attribution-NonCommercial-ShareAlike 2.0 UK: England & Wales
Creative Commons Attribution-NonCommercial-ShareAlike 2.0 UK: England & Wales