Why are networks so hard to build? (by )

One of my sidelines is network management.

Often, the problem is this: you have a bunch of sites, each with zero or more external connections out to the wider Internet (or to people who you provide an Internet connection to), and each with zero or more computers that need some level of network connection (be they servers or workstations). Each computer needs to be able to talk to some subset of the other computers, and maybe able to talk to computers out on the Internet or some other external network, and maybe computers on the Internet or some other external network are able to talk to it. And computers may be on public IP addresses, or on a private IP address; in the latter case, if it can talk to other external networks there needs to have a public IP address (possibly shared with others) that its connections are NATed from, and if incoming connections are allowed, there must be a public IP to which those connections are sent to be "forwarded" into the private IP. We can think of those NAT/forwarding public IPs as "virtual IPs", which don't correspond to a physical computer, but seem to by way of some form of port/address translation.

Also, each computer or external network connection needs some level of reliability. Some have low requirements, and we can happily tolerate perhaps up to a day of outage per year; that's mere 99.7% uptime. The fabled "five nines uptime", 99.999%, equates to a maximum of about 30 minutes of downtime a year. And that downtime isn't just used up by equipment failures; if your network's requirements grow and you need to upgrade things to provide more capacity, you might need some downtime to replace and reconfigure things.

In other words, the problem domain is already complex. But the fun's just starting.

To implement the requirements, first of all, we need a sub-netting plan. We need to start by grouping our own computers together into sets that have the same access requirements (more or less) and will be in the same approximate geographic location. Given our available range of public IPs, and the set of IPs set aside for private networks, we need to choose IP ranges of the appropriate types for those sets of computers, allowing room for projected future growth. We need to keep lots of IPs spare for network management and inter-router links - but more on those later.

Then we need to think about physical switches. Switches are the basic building block of the network. Each computer needs to plug into a switch, or maybe more than one if the uptime requirements of that computer's network connection are too high to allow for the expected frequency of switch failures. Each switch can carry multiple subnets, using VLANs, and if a subnet exists on multiple switches, we need a 'trunk' cable joining the two switches (which can carry multiple VLANs using tagging). We don't need a cable between every pair of switches carrying the same subnet, though; a switch will happily forward traffic coming from one trunk out to another trunk, but we need to be mindful of our uptime requirements; if an intermediate switch or trunk fails, then the subnet will be shattered into pieces unless we set up alternative paths through the network (and suffer the consequences of Spanning Tree Protocol aka STP, which doesn't yet support VLANs very well; you pretty much have to trunk every VLAN to every switch if you're using STP and VLANs together).

But just getting the computers onto their subnets is merely the START of the fun. We've got all these subnets, and computers in the same subnet can talk to each other, but there's still no communications between the subnets. For that, we need routers and firewalls.

Routers and firewalls overlap heavily; most firewalls are also routers. But a device calling itself a "router" rather than a "firewall" is usually a much more sophisticated router (often with very basic firewalling abilities). The function of a router is to route traffic between subnets; the function of a firewall is to do so as well, but "filtering" traffic against a list of rules defined within it, and potentially implementing the port/address translation required to connect subnets that can't actually be joined with routing, by providing virtual IPs on one subnet that map to real IPs on the other subnet.

Connections to external networks are also handled by routers and firewalls, as they're really just other subnets.

Since routers tend to have low numbers of actual ports to plug cables into, it's normal to either have a bunch of routers, each talking to a number of subnets or connections to external networks, and then all connected to a special "backbone" subnet set up purely for the routers to communicate with each other. So traffic between two subnets will often go through two routers; one to get onto the backbone, and one to get back off of it and into the destination subnet.

So, except for the rare case of a subnet of computers that require no access to other subnets, every subnet has one or more routers or firewalls on it. Usually, each "edge" subnet full of your own computers is joined by a firewall to a "backbone" subnet, and then routers connect "backbone" subnets (for there may be more than one) to each other and to external networks.

Often, talking to an external network requires supporting Border Gateway Protocol aka BGP, in order to dynamically exchange connectivity information with the external networks, in order to enable the Internet as a whole to reconfigure itself when cables and routers fail; this is the kind of feature that distinguishes a "proper router" from a firewall. Within the organisation, the BGP routers need to talk to each other to exchange their BGP routing tables, so they can decide which of them has the best link to any particular external network; and usually you use a protocol such as Open Shortest Path First aka OSPF (which is likely to be supported by your firewalls) between your internal routers, firewalls, and the BGP routers, in order for them to exchange information about their connectivity, so your internal network can likewise detect and adjust for internal cable, router, and switch failures. OSPF is aimed at small networks compared to BGP's focus on the global Internet; so you just advertise a "default route" into OSPF from each BGP router, meaning that any traffic destined for the Internet goes to the nearest BGP router, which can then use its larger BGP routing table to decide which BGP router has the best link to use (or, if it's that very router, which link is best); it can then forward it over the backbone subnets to that router.

Most firewalls just appear as routers that happen to be choosy about what traffic they forward, but if you have any virtual IPs around, then the firewall responsible for them pretends to be a computer on the subnet the virtual IPs are on; but traffic going to those addresses just gets forwarded, and traffic coming out from the protected computer is adjusted to appear to be coming from a virtual IP.

Now, you can organise your uptime requirements by putting in multiple routers between given pairs of subnets, easily enough; then any one router (or the cables connecting it, or the switches its subnets go through) can fail and there'll be another path. For really high uptime requirements, you might arrange triple redundancy instead, but dual suffices for most cases. Routers and switches are quite reliable (it's the cables that go, because they get cut or unplugged by accident...). Firewalls are a little trickier, because while traffic can go via different paths through a network quite happily; firewalls often need to keep track of all traffic to their protected subnets in order to implement the filtering; if part of some traffic goes through one firewall and part through another, neither firewall can necessarily see enough of it to decide if it's legal or not, nor maintain the "connection state" labelling that particular stream of traffic as allowed or not allowed (and other information needed to make virtual IPs work correctly). So to provide higher uptime than a single firewall can provide, you usually need to have two firewalls (that are designed to do this, so generally the same model from the same manufacturer) running side by side with special connections between them so they can share that information between them continuously.

Another complication is that the actual computers at the edge of the network typically only support talking to a single router. They'll know to talk directly to other computers on the same subnet, but for talking to any other computer, they often simply send all the traffic to a single designated router on the subnet. This means that, while the backbone subnets can have lots of routers on them quite happily, all talking OSPF and BGP to share their routes between each other and so route traffic to the right router to forward it on, edge subnets prefer to have just a single router.

So what of your uptime?

As with the firewall case, support has emerged for routers (again, typically having to be from the same manufacturer to be compatible) to work as a pair with a single IP shared between them, to make this possible. But this complicates things a little.

Oh, and I mentioned that all this might be in different physical sites, didn't I? Depending on the kind of cables you can run or rent between the sites, you may have limitations. If you can run your own cables, or have a very helpful person you can rent them from, you might be able to get a trunk cable that can carry lots of VLANs, and thus join arbitrary subnets between sites. But sometimes you are limited to a single subnet over the inter-site cable, and sometimes not even that; sometimes you are restricted to a leased line over which you can only carry a tiny subnet with two IPs, to join two routers, and you cannot have any proper inter-site subnets, and need to have routers joining the inter-site link to backbone subnets in each site.

This sucks

Give complex requirements, and then the limited capabilities of the cables, switches, routers, computers and firewalls available to you, actually designing these networks can be a nightmare. Especially when you need to justify your budget for switches and routers, while still providing the required levels of uptime. And even when you've done it, you now have this complex tangle of VLANs and switches and routers and firewalls, which you need to document and maintain and diagnose problems on; and then a year later, new requirements will emerge; and you need to do the whole thing again - with the exciting new constraint that you need to make the best use of your existing set-up (to cut costs compared to throwing it all away and starting from scratch), and to introduce the new set-up while still keeping to your existing uptime requirements.

Needless to say, this sets me wondering how it could be better.

The real requirements

Really, every computer or external connection (of any kind) needs to be plugged into some kind of network infrastructure gadget in order to be part of the network. That much is evident. Some computers might need to plug into more than one, for redundancy.

And those infrastructure gadgets need to be plugged into each other with cables, so they can co-operate to create the network; the level of route redundancy needs to be sufficient to meet the uptime requirements of the computers and external connections.

And those gadgets need to know your requirements - each computer needs to be assigned to a subnet, or to a group of subnets if it's a trunk connection (some computers want to be on more than one subnet, so just giving them a trunk with multiple VLANs saves on cables); each subnet needs rules about which other subnets it can talk to, and how (including virtual IPs, and each subnet needs a router IP that can be told to the computers so they know who to pass their inter-subnet traffic to); and external links need various details supplied. And if the gadgets know your uptime requirements, then they could know their own approximate chances of failure, and guess approximate failure statistics for the cables between them, and complain if there's not enough redundancy in the network, and suggest what you need to buy to meet your requirements.

And... that's all. All the stuff about routers and firewalls, and making them redundant, can really just be figured out by the gadgets between themselves; the gadgets will mainly be switches that establish trunks between themselves and provide the right subnets on the right ports, but the routing and firewalling can just happen within them. Rather than buying separate physical things to be routers and firewalls, you can just buy gadgets that are switches with enough power inside to do routing and firewalling tasks, and let them co-operate together to arrange it.

This not only makes it easier to design and maintain the network, it makes it more efficient and improves redundancy. If you have two gadgets, each with two subnets on, the two computers on different subnets on the same gadget can communicate with each other, with the gadget doing the job of a router; while traffic between two computers on different subnets on the other gadget can likewise communicate directly due to the other gadget also being a router. There's no need for all the traffic between those subnets to come together to a router (or even the active member of a redundant pair of routers); traffic only needs to go over the trunk between the two gadgets if it's really destined for a computer on the other gadget. We don't need complex router failover protocols to enable a pair of routers to work together any more; as every gadget will route between any subnets on it, subject to the rules. And when they are told to firewall between subnets, they will exchange the connection state information between themselves over their trunks.

How we'd implement this

For a start, the minimal hardware requirements for a gadget is that it's a computer with a bunch of network interfaces (mainly Ethernet of various kinds, or maybe leased lines). And it'll need a serial console port for configuration. But there's plenty of scope for connecting the external interfaces via a configurable switching matrix, and then putting the internal computer onto that switching matrix through an internal interface, or more for capacity. And there's also plenty of scope for building dedicated routing/firewalling hardware and connecting that to the switching matrix via internal interfaces, too, or building some level of routing/firewalling logic into the switching matrix itself.

This kind of hardware already exists; most high-end switches can already perform complex routing tasks, and high-end routers often contain small configurable switches. The convergence is already happening at the hardware level; it's all a matter of software.

Each gadget will need its local port map, specifying what each port is to do - for ports connecting to computers, which subnets to provide; for ports used as trunks between gadgets, perhaps a shared secret so that the traffic can be encrypted, as somebody tapping into that link could let themselves onto subnets they're not allowed to. And for ports connecting to external networks, the details of that network.

Given that, each gadget can easily configure its switching functionality, organising VLANs and deciding what ports to put them on, either directly or trunked together. All subnets can be trunked onto the links between gadgets, along with a special internal VLAN for gadgets to communicate with each other directly.

Then each gadget will need a copy of a global policy file, which details the routing and firewalling rules between subnets, and the uptime requirements. This should be the same for every gadget in the network, so it might as well be automatically replicated between them; a new gadget that's plugged in will just get a copy from another gadget over the internal VLAN on the inter-gadget trunk, and a new configuration can be provided to any gadget in the network (via its serial port, or via a management subnet if you have an ssh authentication set up to sftp the new configuration file in), for it to spread out through the network.

Given that, each gadget can then easily configure its routing and firewalling functionality, and establish connection state and BGP route sharing with other gadgets implementing any given firewall. And the gadgets can communicate with each other to establish the global connectivity, so that traffic that needs to be routed between gadgets can be sent along the correct path; rather than having separate failover at the switch (STP) and router/firewall (OSPF) layer, we can do it all in one, simplifying matters. When links or gadgets fail, a single mechanism will ensure that connectivity continues, so there's a lot less to go wrong and a lot less to have to think about in planning.

And when it comes time to upgrade, just put in more gadgets, tell them what their ports are connected to, and just plug them in. The existing global policy will flow into them, and they'll join the network; then you can amend the global policy to enable routing to the new subnets you've created, and the new computers you've plugged in will join the network.

There's no need for backbone subnets. Each gadget can route between any subnets it carries, and route traffic to gadgets that have subnets it doesn't carry. You just need subnets for computers, and subnets for external links. From the perspective of the users of the network, a single giant router connects all the subnets. But in reality, gadgets with BGP peering will be talking to each other and to their peers to establish a global routing table, while gadgets without BGP peering will be routing between themselves and to the nearest BGP gadget for external traffic, but the routing function is virtualised and distributed between them to be one virtual router.

Although the functionality of the gadgets is virtualised and distributed between them, the gadgets needn't be homogenous. Gadgets with BGP peers should be ones with lots of RAM to store the big routing table. Gadgets with high levels of traffic through them should have more advanced levels of switching and routing hardware. One danger is that an extension to the network shifts traffic patterns, causing more traffic to go through previously lightly loaded gadgets which might have once only had to perform basic switching functions (perhaps only carrying a single subnet), but which now find themselves routing, firewalling, and maybe even implementing BGP - so it'll be wise to provide a tool that reports on what levels of different activities (switching, routing, basic firewalling, port/address translation, and BGP peering) each gadget is doing, and how it's doing for CPU time and RAM.

In effect, what I'm proposing is a slight change to how switches are configured, to make managing a large redundant network easier; then virtualising and distributing the routing/firewalling job.

It'd be nice to make the inter-gadget protocol a standard, so you can mix gadgets from different manufacturers. But that'd be a tricky thing to organise. As the market currently stands, a manufacturer that provides a network like this would have a nice commercial edge, and they'd just love to patent it and keep competitors out. So I predict that, to begin with, such setups would constrain you to a single vendor; but once they've all got their own offerings so it's no longer a competitive advantage, perhaps some standardisation efforts could occur.

Inter-gadget routing can use techniques learnt from Multi Protocol Label Switching. At the basic level, when an Ethernet frame is destined for a different gadget, the source gadget can just tag it with the ID of the target gadget and send it down the best trunk; subsequent gadgets that receive the frame can examine the target gadget field, looking deeper inside if the frame is destined for this gadget, otherwise just forwarding it on without any more complex inspection required. This means that gadgets with only other gadgets plugged into them (the 'backbone gadgets') need only ever perform this very simple forwarding function, meaning they can achieve high throughput with only a simple (and therefore cheap) hardware switching matrix.

Still, a network hardware company might not want to implement something like my gadgets. They do too well out of selling expensive separate switches, routers, and firewalls, and having a broad, complex, product line to snare the unwary with - with price differences between products often based more on different licence keys that enable different bits of functionality rather than anything in the product; but the gadgets approach relies on being able to distribute functionality dynamically through the network. There's three approaches to this issue. You can buy licences for a network rather than for a device (so the ability to do firewalling) and the licence key is replicated across the network with the configuration. Or you can buy licences for devices, and have the gadgets take into account the capabilities of different gadgets to allocate functionality; perhaps a gadget without a routing licence can just act like a switch, passing all traffic for transport between subnets to another gadget... but we're trying to move away from that.

Or we can just hope a small startup goes ahead and does it, kicking the licence-extortion business model out from under the industry.

6 Comments

  • By Jamesb, Thu 10th Sep 2009 @ 2:13 pm

    Al! You're totally linked to from Hacker News! Nice one. Also, my eMail web interface is still getting worse :( Any ideas yet? SSH problematic from my workplace.

  • By David Cantrell, Thu 10th Sep 2009 @ 8:45 pm

    Reliability is normally measured against scheduled uptime, not just against time. So a three hour planned outage for maintenance that your users have been notified of in advance is Just Fine and won't prevent you hitting your 99.999% target.

  • By Faré, Sat 12th Sep 2009 @ 6:55 am

    OK. So we need to find the hardware guys and we're ready to launch the startup...

  • By Faré, Sat 12th Sep 2009 @ 7:11 am

    Or rather, start by modifying the NetBSD and Linux kernels to allow any old multiport device to function like a "gadget". Then bring in the hardware hackers to make cheap gadget devices with some off-the-shelf component, ASIC, FPGA or something in between.

  • By alaric, Thu 17th Sep 2009 @ 8:45 am

    Yeah, doing it in an open-source UNIX would be a good start! The bridging-with-fast-path-routing would need a bit of light kernel hackery (as the effective per-subnet router IP would need to be present, with the same MAC, on every gadget carrying that subnet, so we'd need to be able to do that - without ARP battles occuring).

    The rest can be in userland.

  • By Violet, Mon 28th Sep 2009 @ 10:55 am

    I find it very interesting to compare-and-contrast this idea with the state of routing in real-time video. There, you basically always have one big router, as big as you can afford, and it is the heart of your entire video/audio infrastructure, and is horrendously complex to setup. Perhaps this reflects the fact that long-distance real-time video (at broadcast levels of quality) is still so expensive that you only have a very few of these external links, and organise your schedule around their availability, so most of the complex routing is local.

Other Links to this Post

RSS feed for comments on this post.

Leave a comment

WordPress Themes

Creative Commons Attribution-NonCommercial-ShareAlike 2.0 UK: England & Wales
Creative Commons Attribution-NonCommercial-ShareAlike 2.0 UK: England & Wales