Tuesday, April 21, 2009

My trouble with bonded interfaces

In an effort to improve the redundancy of our network, I have all of our blade servers configured to have bonded network interfaces. Bonding the interfaces in linux means that eth0 and eth1 form together like Voltron into bond0, an interface that can be "high availability", meaning if one physical port (or the device it is plugged into) dies, the other can take over.

Because I wanted to eliminate a single point of failure, I used two switches:



The switches are tied together to make sure traffic on one switch hits the other if necessary.

Here is my problem, though: I have had an array of interesting traffic patterns from my hosts. Some times they'll have occasional intermittent loss of connectivity, sometimes they'll have regular time periods of non-connectivity (both of which I've solved by changing the bonding method), and most recently, I've had the very irritating problem of a host connecting perfectly fine to anything on the local subnet, but remote traffic experiences heavy traffic loss. To fix the problem, all I have to do is unplug one of the network cables.

I've got the machine set up in bonding mode 0. According to the documents, mode 0 is:


Round-robin policy: Transmit packets in sequential
order from the first available slave through the
last. This mode provides load balancing and fault
tolerance.



It would be at least logical if I lost 50% of the packets. Two interfaces, one malfunctioning, half the packets. But no, it's more like 70% of the packets getting lost, and I haven't managed to figure it out yet.

If you check my twitter feed for yesterday, I was whining about forgetting a jacket. This is because I was hanging out in the colocation running tests. 'tcpdump' shows that the packets are actually being sent. Only occasional responses are received, though, unless the other host is local, in which case everything is fine.

There are several hosts configured identically to this one, however this is the only one displaying this issue. Normally I'd suspect the firewall, but there isn't anything in the configuration that would single out this machine, and the arp tables check out everywhere. I'm confused, but I haven't given up yet. I'll let you know if I figure it out, and in the mean time, if you've got suggestions, I'm open to them.