Wednesday, June 25, 2008

Appropriate Redundancy

I think we can all agree: more uptime is better. The path to that goal is what separates us into different camps. The men from the boys, or perhaps more accurately, the paranoid from those who have never had a loaded 20amp circuit plugged into another loaded 20 amp circuit (true story from the colocation we're at. The person responsible no longer works there).

When it comes to planning for redundancy, there's always a point where things get impractical, and that line is probably the budget. Most companies I've worked for are willing to put some money into the redundancy as long as it's not "a lot" (whatever that is), and to even get them that far, you're probably going to have to work for it. The logic of buying more than one of the same thing is sometimes lost on people who balk at purchasing such frivolities as ergonomic keyboards.

Assuming you can get your company behind your ideas, here's how you might plan for redundancy:


  1. Power
    If you can at all avoid it, don't buy servers without dual power supplies. Almost every server you can get today has this feature, and there's a good reason for it.

    Dual power supplies are hot-swappable, which means when one dies, you can replace it without stopping the machine. If you're not in a physical location with redundant power, you can also hook each of the power supplies into a different battery backup (UPS).

    Many servers come with a 'Y' cord to save you from using extra outlets. It looks cool, but ignore it. If you plug both power supplies from the same server into the same battery and that battery dies during a power event, so does your server. Each cord goes to it's own battery. That allows the the batteries to share the load, and it buys you some insurance if one of them bites it.

  2. Networking
    Again, almost all servers these days come with on-board dual network cards. I used to wonder why, since I've only rarely put the same server on two networks, and setting up DNS for dual homed computers is a pain in the butt.

    After much research, I learned about bonded NICs. When you bond NICs, your two physical interfaces (eth0 and eth1) become slaves to a virtual interface (typically bond0). Depending on the bonding mode, you can get redundancy AND load balancing, essentially giving you 2Gb/s instead of 1.

    Two NICs acting like one is only halfway ideal, though. In the event of a switch outage (or more likely, a switch power loss), your host is still essentially dead in the water. To remedy this, use two switches for twice the reliability. For my blades, I've got all of the eth0 cables going to one switch, and all of the eth1 cables going to another. I also use color-coded cables so I can easily figure out what goes where. Here's a pic:



    As before, make sure that your switches are plugged into different UPSes, if possible.

  3. Redundant Servers
    Despite your better efforts, a server will inevitably find a reason to go down. Whether it's rebooting for patching, electrical issues, network issues, or some combination of the above, you're going to have to deal with the fact that the server you're configuring may not always be available.

    To combat this affront to uptime, we have the old-school option of throwing more hardware at it. To quote Hadden from the film Contact, "Why build one, when you could have two for twice the price".

    Getting two servers to work together is contingent on two things: What service you're trying to backup, and where you're getting your data. Simple things, such as web sites, can be as easy as an rsync of webroots, or maybe remotely mounting the files from a fileserver over NFS. Other things, such as NFS, require expensive Storage Area Networks (SANs) or high-bandwidth block level filesystem replication across the network. Either way, it's something that you should research heavily, and is outside of the scope of this blog.


  4. Hopefully you can use this information to give you some ideas on redundancy in your own infrastructure. It's a rare sysadmin who can't afford to spend some time to consider things that would make their network more fault tolerant.