Snell-Pym

TCP sucks (by alaric)

The go-to transport-layer protocol for people to build Internet applications on these days, with the exception of real-time streaming media or a few specialist apps, is TCP.

The problem is, TCP isn't really all that great for the things we're using it for.

I won't go into any more detail on how it works than is necessary to make my point, but let's take a look at how TCP is used by an application.

Somewhere, a program wants to provide a service via TCP, so it tells the network stack it would like to listen on a chosen TCP port (eg, port 80 or 443 for a Web server).

Once that has happened, a client somewhere can attempt to connect to it, by specifying the address of the computer running the server and the port number. There is a short delay while the "three-way handshake" occurs - the client sends a request to the server (called SYN), the server responds with an offer (called SYN/ACK) and the client replies to accept that offer (called ACK). The time taken for the connection to be usable by the client is the time taken for the SYN to get to the server and the SYN/ACK to return, known as the round-trip time of the connection.

After this, the network stack on the server notifies the server program that there's a new connection, and the network stack on the client notifies the client program that the connection is ready - if all goes well.

But if the connection is up, then both ends can send each other a stream of bytes. TCP takes the underlying network - which is packet-based, meaning that small chunks of data are sent as "messages" - and creates the illusion of something like a serial cable, or a UNIX pipe between processes: just a stream of bytes. Behind the scenes it handles detecting lost packets and re-transmitting them, ensuring that packets that arrive out-of-order are passed on in the correct order rather than randomly rearranging parts of the byte stream, and "flow control": detecting when the traffic being sent is overwhelming some bottleneck in the network between client and server and ramping down the rate at which it's sent to not waste resources.

TCP's poor flow control

It performs flow control by detecting congestion in the only way really possible across the Internet: looking at the rate at which the packets it sends actually arrive at the recipient. Some packets get lost due to equipment faults or noise on communications links, which is a sort of background "fatality rate" of packets in-flight, regardless of how much data is being sent. But when a router somewhere in the network is receiving packets faster than it can send them down some link, they queue up, and if the queue gets too full, it starts dropping packets rather than keeping them in memory for too long. So the rate of packet loss on a link will stay at the background "fatality rate" as the rate at which data is being sent rises, until we hit the capacity of the weakest link in the system, at which point the packet loss rate will rise - after a short delay as the queues fill up to the point where they start dropping things.

TCP uses a relatively simple trick: when a packet it sent is lost (which it knows about, because the other end fails to acknowledge receiving it), it decreases the sending rate a bit. But when traffic is getting through without a problem, it increases the sending rate a bit. This means that as long as the basic rate of background packet loss isn't too high, it will slowly increase the rate of sending until it overwhelms the slowest link, then after it's noticed (which takes a while for the queue to overload, and something be dropped, and the loss be noticed by the receiver and this fact communicated back to the sender), it will decrease the sending rate - so that the rate of sending will hover around the maximum, pushing it slightly until it gets some packet loss then moving back again. (There's also a technique called Random Early Detection (RED) to try to signal link overloads before they become critical and improve the reaction time, and a mechanism called Explicit Congestion Notification (ECN) that also helps, but ECN isn't widely used at the time of writing).

If you watch a bandwidth consumption graph when doing a large file transfer, you can sometimes see small ripples in the graph as this happens.

But there's a few problems with this.

Firstly, when a new transfer starts, the TCP stack has no idea how much bandwidth will be available; it has to take a guess and then rapidly adjust up or down to the actual bandwidth. In practice, it starts low and then rapidly increases, a process known as Slow Start. However, because it's ramping the rate up quickly to find the point where loss kicks in, any background loss will confuse it and cause the connection to start slow, then only slowly crank up the speed; and for short-lived connections (or connections with short bursts of traffic) it might never find the top speed before it finishes.

Secondly, this process happens independently for every TCP connection. If you have two connections from one computer to another, both will be increasing and decreasing their sending rate independently - either of them can saturate the bottleneck link, but the packet loss caused by this might be felt by both connections. This means that if one connection is sending too fast, it might be the other one that gets throttled back because it lost a packet. This can happen when multiple connections from and to different computers happen to come together at a single bottleneck in the network, too; and in the common case where a link is overloaded and all connections through it suffer some packet loss, and they all scale back at once, they collectively over-react because they can't coordinate with each other - leaving the link under-utilised.

TCP's over-eager in-order delivery

TCP ensures that it delivers the bytes it's send in order by marking each packet full of bytes with its position in the stream. If the recipient receives bytes 0-100, then bytes 201-300, then bytes 101-200, it will deliver bytes 0-100 directly to the application; sit on bytes 201-300 as they are not the next bytes; deliver bytes 101-200 as soon as they arrive, followed by the saved bytes 201-300. So the application receives every byte in order.

The problem is, most TCP applications fall into two categories:

Downloading stuff into a file. If the application doing the download was just told "Here's bytes 201-300", it could save them at position 201 onwards in the file, then save bytes 101-200 at position 101 onwards in the file, as soon as they appeared. Buffering them in the TCP stack doesn't really help anybody, except perhaps taking the burden of avoiding disk seeks off of the disk I/O scheduler a little...
Sending discrete requests/responses/messages. Often, an application has a message to send. Since TCP just transmits a stream of bytes, the application needs to mark the start and end of each message. TCP will do the job of splitting the message into chunks small enough to fit in a packet, sending them, and putting them back together in order, which is great within the message; but if the application sends two messages, and part of the first message is delayed (either taking a long route through the network, or it didn't arrive and has to be re-sent), then the entire second message sits there inside the TCP stack, not provided to the application.

Sometimes the application is glad of this - if it's important the messages are processed in order, sure, it needs to wait. But often it isn't important, and it would in fact be more useful if messages could be processed as soon as they arrive in whatever order they arrive. Because TCP doesn't know about how the application is breaking the TCP byte stream into messages, it can't know that there's a complete message it could pass on - all it knows about are bytes.

TCP makes applications build their own messaging protocols on top

As mentioned in the previous point, most TCP applications introduce their own structure for messages on top of TCP, because a raw stream of bytes is too low-level for most applications' communications needs (almost no TCP connections are nothing more than a raw stream of bytes; telnet is almost that, but has message-based control signals embedded - perhaps an FTP data stream?). The thing is, this pushes a bunch of complexity into the app that must be re-invented every time as standardisation in this area has been poor.

A lot of TCP apps go for a simple request/response model. The end that makes the connection is assumed to want some service from the end that sat there waiting for a connection, so once a connection is made, the service provider sits there waiting for requests. The client will send a request in some form, with some way of marking the end of the request (either sending the length first so the other end knows how many bytes to read, or having a special marker for the end - with attendant complexities if it needs to send that marker as part of the request for some reason). It will then sit waiting to read data back, as the data returned is the response to the last request sent (again, with some detail as to how to know when the response has actually ended).

This is simple to implement - the code in the client looks like "Send request; Read response; process response", and the code in the server looks like "while(connection not closed): read request; process request; send response".

But it becomes a problem when you want to scale up the performance of your application. If your application starts a long-running request, then other requests can't happen in the meantime - you can send them, if you wish, but even if the server is watching for additional requests arriving while it's processing a request (which is additional complexity in the server to implement) it can only send the responses back in the order they were sent. So some protocols add a "request ID" to requests, and responses can arrive in any order because they contain the ID of the request they're responding to. Now the client and server applications can send requests at will and send responses back as soon as they're ready, at the cost of needing to implement their own multithreaded routing logic, with locking around access to the TCP connection to ensure that parts of requests or responses don't get inserted into each other due to trying to send two at the same time. However, it's still not perfect - if the client requests a large file, several gigabytes in size, and that's sent as a single response (after all, it is the response to a single request, and the client might not have known how big the file would be in advance), then the TCP connection from server to client is taken up sending that large response for some time, and other, quick, responses can't be sent until that one has finished.

To fix that, the protocol needs to be made more complicated - requests and responses can now be split into smaller "chunks", and chunks from multiple requests or responses interleaved along the TCP connection, to be reassembled at the other end.

And also, sometimes the "client initiates things, server sits waiting" model isn't enough. A client application might request some information, and want the server to update it immediately if that information changes. If only the client can initiate things, then it must keep asking the server if there's been any updates to get a response. Ideally, both ends should be able to send a message, or a request expecting a response, to the other end. Under the original simple request/response model that can't be done - but if we've already gone to the effort of implementing concurrent requests and responses with chunking, it's a drop in the ocean to make the connection symmetrical...

There have been a few attempts to standardise a useful protocol on top of TCP. In practice, people use HTTP and Websockets: HTTP as a framing protocol to denote the starts and ends of messages, with multiplexing of concurrent requests and chunking, and Websockets as a way to set up "reverse" connections so that servers can send messages to clients at will. It's a pretty complicated stack to implement, and fairly wasteful in terms of data sent on the wire because of the various workarounds and compatability hacks required to adapt what was initially designed as a "Hyper-Text Transport Protocol" to this task.

BEEP is an attempt to start from scratch; the goals look pretty good (although it was worked on back when XML was cool, so has a nasty XML flavour to it), but it's not taken off; I've never encountered an implementation and the project web site hasn't been updated since 2016 at the time of writing.

TCP is inflexible

Relatedly, often applications that send messages down a connection have various kinds of messages with different requirements. Imagine an online game - if the player presses a sequence of buttons in the game, then those button presses must arrive at the server, and in the correct order. But at the same time, the server might be streaming back messages with the latest position of all the other players; if one of those goes missing, stopping the entire stream to wait for it to be retransmitted, when the stream is filling up with messages that contain more recent player positions and so make the missing message entirely obsolete, is wasteful. But, again, as TCP doesn't know about the application's message boundaries, it can't handle different messages differently.

The three-way handshake (and connection shutdown) is slow and wasteful for short connections

If part of the three-way handshake used to open a TCP connection gets lost in the network, the client finds out because it doesn't get the SYN/ACK back in a reasonable timeframe and tries again. It will keep trying for a while before giving up. Either way, if the network is down or unreliable, there can be several seconds' delay before the network stack reports back to the client program that it has a working connection, or that it can't get one.

And the three-way handshake (and the FIN packets used to close a connection) can be a waste of time. If the entire thing to be sent down the TCP connection could have just fit in a single packet, we end up doing this:

CLIENT -> SERVER: SYN
SERVER -> CLIENT: SYN/ACK
CLIENT -> SERVER: ACK + DATA
SERVER -> CLIENT: ACK
CLIENT -> SERVER: FIN
SERVER -> CLIENT: FIN/ACK

Six packets sent, to send a single packet of data! We could have just done this:

CLIENT -> SERVER: DATA
SERVER -> CLIENT: ACK (so the client knows it arrived and doesn't have to resend it)

On top of that, those short connections will get NO flow control; TCP's per-connection flow control won't even get off the ground.

Because of this small-connection penalty, there has been much work in HTTP to make it re-use a single TCP connection for multiple logical HTTP requests, but this has two downsides:

It pushes more complexity into protocols on top of TCP, such as HTTP, to work around TCP's deficiencies; and if your application only makes a single HTTP request itself it will use its own TCP connection, even though thousands of instances of your application might be running in parallel on the same computer.
A lot of TCP connections are closed very quickly because there's something wrong with the first bit of data sent - an incorrect password, some error with the request message that was sent, or because the receiving server has a problem or is responding with a redirect to another server because it's busy. This usually precludes any other HTTP requests that might have been going that server's way in the near future; they're either never tried, or redirected to another server.

TCP checksums are a bit weak

TCP puts a checksum in packets, so if the packet is damaged in transit across the network, when it arrives, the TCP stack on the recipient can reject it and it will be re-transmitted.

However, the checksum is pretty weak, and in practice, quite a lot of errors are not detected by TCP. Applications on top of TCP often assume that any errors will have been caught and handled with, and will blindly trust data to have arrived unmolested. It's hard to say how much data has been damaged in this way, but thankfully cryptographic layers on top of TCP such as SSH and TLS are pretty good at identifying bad data, so the situation is improving - but no thanks to TCP!

TCP can't handle multicast

It's not practical to provide ordered, reliable, delivery in a wide-area multicast. When lots of people are streaming the same live video stream from Youtube, they each get their own TCP connection to the nearest node, and they call carry the same data.

However, it is practical to support scaleable wide-area multicast: IP multicast does it, although support for it across the public Internet is sadly poor. But because TCP is "ordered and reliable or bust", to get IP multicast, you need to throw the baby out with the bathwater and use an entirely different protocol.

As an aside, flow control in such a multicast streaming environment is interesting. You can't slow down the sending of packets due to slow recipients or intermediate links, so instead you use a "drop priority": encode your video stream in quality layers, with a base layer containing a minimal-quality stream, then additional layers that contain extra data to increase the quality on top of the layers beneath. You can then mark the different packets that make up the streams with a "drop priority". When a link is overloaded, or the final destination can't cope because it's busy, packets with a higher drop priority are discarded first - so you put the lowest drop priority on the base layer, then increasing drop priority on the increasing quality layers. So if you don't have a good enough connection, or a fast enough processor, to handle the 4k HD stream, then the network discards the bits of it that make it 4k while still keeping the bits that enable it to be a 1080p stream...

Can we do better?

Yes, we can! Various network protocols improve on TCP in different ways. Those who are familiar with my style of designing software will no doubt not be surprised that I have a plan to merge the tricks they use into one protocol that can meet a wide variety of needs, by being flexible... Let's take a tour of ideas I've gleaned from various protocols!

Reliable Datagram Protocol / Reliable User Datagram Protocol

RDP and RUDP are both message-based protocols, that handle retransmission of lost packets, ordered delivery and flow control. But being message-based, it is possible to request different types of service for different messages even in the same logical connection.

CHAOS

CHAOS is a long-defunct protocol, but it has some interesting features that are worth examining. I've discussed it in an earlier blog post, but it includes the ability to do simple requests without any three-way handshakes, the ability to include request details when setting up a connection so that it can be rejected outright rather than needing to wait for the handshake to complete, and the choice between connectionless and connection-oriented communications under one umbrella.

NetBLT

While TCP tries to create the illusion of a single stream of bytes, NetBLT is all about moving a block of data. This is useful, because most of the time, TCP applications are trying to move a block of data rather than a stream of bytes - be it a chunky file upload/download, or a small request.

It does not attempt to provide in-order delivery; as chunks of the data arrive at the recipient, they are given straight to the receiving application. If the application wants to assemble them in a buffer in memory until it's all there, it is free to do so, but if it wants to do something with each piece as it arrives - such as writing chunks to a file, or displaying parts of an image, or processing parts of a random-access data file - in whatever order they arrive, thereby freeing up memory and offering faster responses, it can.

But NetBLT also has a smarter flow control system than TCP. Rather than fiddling around with sliding window sizes (an awful detail of TCP's flow control I have saved you, dear reader, from), it picks a rate to send packets at, sends packets at that rate, observes how many arrive, and tweaks the rate up or down. The original NetBLT protocol was abandoned and the flow control algorithm was never really finished, but in principle, it will be easy to construct a control feedback loop that tweaks the rate to get it just at the point where the packet loss rate starts to rise due to congestion, without being fooled by reasonable levels of background packet loss as found on wireless networks, and react quickly to changes - because the indirection and delay involved in managing the TCP sliding window is absent; the controlled variable is, directly, the outbound bandwidth consumption.

Per-computer flow control

This is an idea I've had myself; I've never read about it already being used, which surprises me as it's a pretty simple idea, and quite relevant to modern usage patterns.

Rather than handling flow control on a per-connection basis, track flow control only for each different computer you're connecting to. Multiple connections can use the same flow control state if they go to the same computer at the other end; we scale back the rate at which we're sending to a particular computer, even though we're sending data down multiple connections to it at once.

This means:

We don't have multiple connections fighting each other for bandwidth, or bumping into each other and both backing off, like we do with TCP.
We can keep that flow control state between short-lived connections, meaning they start knowing how fast they can send based on past experience and not needing to "slow start" or anything every time.
We could, in principle, notice that packet loss to certain groups of other computers tends to rise when the total traffic we're sending to that group of computers goes past a threshold, and deduce that there's a common link to them all (perhaps even guided by noticing that their IP addresses share a substantial prefix as they're in the same network), and apply flow control to them all as a group. A group of "All computers not on my LAN" might naturally arise, identifying the bottleneck of your DSL or mobile data connection, and your computer could then deductively perform smart flow control to manage the limited bandwidth of its connection to the Internet! Finding a good algorithm to group destination computers together in this way is not something I've given deep thought to yet, but it sounds promising...

Combining different communication types

Traditionally, transport protocols such as TCP implement a single model of communication. For TCP, it's bidirectional byte streams. For connectionless reliable datagram protocols like RDP, it's the exchange of messages. For connection-based reliable datagram protocols like RUDP, it's an ordered stream of messages. Applications requiring a combination of communication types need to either pick one protocol based on their "biggest" need and hack the others in somehow (like the common pattern of implementing your own message-exchange protocols on top of TCP), or use multiple ones and complicate their protocol with using a wide range of protocol/port combos (like VoIP phones that need a dizzying array of ports open in the firewall for SIP and RTP and IAX and so on).

There's a lot of duplicated work between those different protocols - they all need to implement their own flow control, retransmission, and large-message fragmentation/reassembly logic, for a start; and connection setup/keepalive/shutdown logic for those with a connection model - just to differ somewhat in how the unit of communication presented to the application is framed.

CHAOS supported connectionless and connection-oriented communications under the same umbrella, and there's no reason why we can't do that today. A transport protocol could work like in layers itself, like this:

Flow control and retransmission (the two are entwined, as retransmitting lost packets is a key indicator of network congestion).
Fragmentation and reassembly of large messages, and fairly interleaving multiple in-progress large message transmissions.
Providing request/response semantics, through giving messages an ID that can be used to match responses to requests.
Connection management, associating messages/requests/responses to particular connections, and optionally ensuring that messages/requests send down a connection arrive in the order they were sent (this could be requested on a per-message/request basis, to support use cases like my hypothetical online game above).

On the topic of connection management, an idea I've had that may or may not be novel (but I've not seen it elsewhere, beyond simple single-level multiplexing schemes with limited scope) is thus: allow the creation of connections within connections (nested arbitrarily). After all, if the operations we can perform on a remote service are "Send a message; Send a request and wait for a response; open a connection" but the operations we can perform on a connection are "Send a message; Send a request and wait for a response"... why not include "open a connection" as well to make it consistent? The ability to multiplex connections within each other would only be useful in relatively niche cases, but lightweight sub-connections could simplify application development. If a top-level connection is established to authenticate a client to a server, and set up cryptographic state used to protect the connection in the manner of TLS and SSH, then the application uses sub-connections within that to manage things like multiple concurrent SQL cursors, there would be no need to recreate all that heavyweight top-level state for each of them, nor to implement your own multiplexing. And allowing arbitrary nesting of sub-connections would mean that existing applications could be bundled together into a parent connection without them even needing to know they were creating sub-connections rather than top-level connections. That could make it possible to create composible libraries of communications patterns!

Combining them all

So, I have a plan to combine them all. As I mentioned in my blog post on CHAOS, I have a rough design for a protocol I call IRIDIUM... but yesterday evening I went to bed early, curled up with pen and paper, and sketched out some packet formats for an implementation of IRIDIUM.

To meet my immediate needs for a network protocol, I want to define two transports for IRIDIUM:

UDPv4 packets. This will let me use IRIDIUM to communication across the Internet, just like TCP.
Byte-stream links. One application I'm working on uses Unix domain sockets, and some embedded hardware projects I have in mind will use serial or radio links. Defining an IRIDIUM transport over byte streams means I can use it for both of those (and the byte stream protocol could support optional re-synchronisation after lost bytes and forward error correction, which are useful on low-level links). Also, for fixed-bandwidth byte-stream links I'll be able to support bandwidth reservation; putting support for that in the application-side API means I'll be able to think about supporting that in the UDPv4 implementation via RSVP for networks that support that (if there are any?)

I'm not going to support multicast initially; although IP multicast adoption might increase in future it's pretty spotty for now - but link-local multicast is widely available and potentially useful. So I'll get to that later!

(edit) I've typed up my IRIDIUM draft specification! Go and have a look!

Alaric, Computing | alaric | Tue 5th May 2020 10:44 am

5 Comments

By Vg, Tue 5th May 2020 @ 9:03 pm

Few years ago, I had to implement a tcp stack. I came to the conclusion that the one sole design goal for tcp was this: connect a dump character based terminal to a Unix machine over the network. TELNET. It’s useless for anything that is not a unstructured character stream.
By John Cowan, Tue 5th May 2020 @ 11:34 pm

The article "Sequenced Packets Over Ordinary TCP" http://urchin.earth.li/~twic/Sequenced_Packets_Over_Ordinary_TCP.html proposes using the obscure TCP urgent pointer to mark the end of the packet, thus providing SOCK_SEQPACKET semantics. This allows you to have messages bridging multiple TCP segments / IP packets, though you can't have multiple messages in a segment.

RFC 962 is a 1985 proposal to enhance TCP by adding NYS (no-way handshake) and NIF (graceless close) packet types. It's only two pages and pretty funny, but perfectly plausible.
By anonymous, Wed 6th May 2020 @ 4:11 am

Hello TCP may not be perfect but it is not that bad either. Most of the complaints you have about it is because you don't understand what problem it is trying to solve.

One example is the 3 way handshake. This handshake is needed to stop spoofing attacks or DDoS amplification attacks.

I would ask you to study TCP in detail before criticizing it or even making your own replacement.
By Andrew Ducker, Wed 6th May 2020 @ 9:30 am

I'm curious why you haven't covered QUIC in this round-up. Seems like the obvious successor.
By alaric, Wed 6th May 2020 @ 5:06 pm

Good point, Andrew! Here's a summary of QUIC:

Much like what I'm proposing with IRIDIUM, it's a transport-level protocol built on top of UDP so it can be developed in userland.

It uses the concept of including some data in the connection setup / response packets, in order to start cryptographic key exchange protocols at the same time as performing connection setup, which helps latency. I don't know if it lets the application include request data in that first packet or if it's limited to cryptographic setup?

Performing the encryption at the packet level rather than on top of a TCP stream avoids the problems I spoke of about application-level framing interacting badly with TCP's in-order delivery. However, would I be right in thinking that QUIC still provides a byte-stream model to the application, so application-level framing on top will still need to be done?

So I think QUIC is a good example of the demand for doing better than TCP, and that you can do better with something built on top of UDP - but it's focussed on being a drop-in TCP replacement, mainly to support HTTP on top, and I think that it's also possible to provide better semantics to the application than a byte stream!

Sarah and Alaric Snell-Pym living in interesting times