An anecdote from my distant past: The Tale of the Half-Working Network (by alaric)
Ok, this is an interesting tale I've retold in person a few times. So I've decided to write it up for the world to see.
The scene: My student flat, 1998. Seven of us are living there, and we're all nerds so there's lots of computers. The world of networked multi-player games is exploding, and we want to play.
So, we buy some network cards, and go to Maplin for some coax cable, BNC connectors and terminators, and start to set up a 10BASE2 LAN.
But one of the computers can't talk to the others. What's going on?
The Symptoms
The Windows drivers detected the network card just fine. The IP address and netmask were set correctly. The machine could ping its own IP, but couldn't ping anybody else, and vice versa.
All the other machines could ping each other, no problem.
When the faulty machine sent packets, the network activity lights on all the other network cards flickered. It was certainly sending packets out, they just didn't elicit a response.
So I ran tcpdump
on my NetBSD laptop and... when the faulty machine sent IPv4 pings, what arrived wasn't an IPv4 packet. They had an ethertype of
So, must be a faulty network card, right?
Looking deeper
So, a valid IPv4 packet, when encoded into an Ethernet frame, has an EtherType of 0x0800
to mark it as an IPv4 packet, and the packet itself starts with a byte whose top nibble is the IP version - 4
- and whose bottom nibble is the IP header length - usually 5
for all non-exotic packets. So the first byte is usually 0x45
in hex.
What I was seeing on the wire had packets that start 0x45
- but an EtherType of 0x08FF
.
And what's more, every other byte of the packet was 0xFF
. So all the packets started 0x45 0xFF (something) 0xFF (something) 0xFF...
Definitely a faulty network card, right?! So I went to swap it with one from another computer to see where the problem went.
Looking even deeper
When I opened my flatmate's PC to swap the network card, I noticed something interesting. The network card, as were most at the time, was a 16-bit ISA card. The 16-bit ISA socket was a backwards compatible extension to the original 8-bit ISA standard, implemented by having an 8-bit ISA socket, then adding an extra socket next to it for the extra 8 bus lines and a few extra IRQ and control lines. So old 8-bit cards could plug straight into the 8-bit socket, but 16-bit cards would plug into the 8-bit socket AND the extension socket.
Jim's network card was at a slight angle; the card hadn't gone in quite straight, and had caught on the end of the extension connector in a way that caused the motherboard to flex away from it as the card was pushed into place, so the extension connector wasn't fully seated.
All the basic communication between the software driver and the card, to confirm the card was there and to tell it what to do, just involved 8-bit transfers - so they worked fine. It's only when the bulk data for the actual Ethernet frames was transferred that the extra 8 bus lines in the extension connector were used, and as they were unconnected, they floated high, meaning that the top 8 bits of every 16 bit chunk transferred were... 0xFF
.
I pulled the card, and pushed it back in carefully, and turned the computer back on... and everything was fine.
Conclusion
This was weird, because you'd exact that if something isn't plugged in properly, it would just completely fail to work at all. Computer hardware is usually a bit all-or-nothing when it comes to that kind of failure.
But the fact that the 16-bit ISA bus was a backwards-compatible extension of the 8-bit bus, providing a faster path for bulk data transfers but not being needed for small "control" commands, enabled this weird partial failure mode. Which, because it flew against my expectations, sent me off barking up the wrong tree entirely! If I hadn't noticed the slight misalignment of the connector when I want to swap the card, I'd probably have done the swap, found everything working Just Fine, swapped back, found everything still working Just Fine, and remained forever mystified as to what went wrong - and probably suspected there was some subtle hardware gremlin lurking that I would have, erroneously, attributed any later problems to as well, leading me further up wrong trees...
See also: https://cve.circl.lu/cve/CVE-2022-38392