Monday, June 16, 2008

Infrastructure upgrades through forest fires

It's funny, sometimes, how we tolerate suboptimal or downright malproductive arrangements in our infrastructures, just because it's inconvenient or inopportune to do it the "right way". It seems like "the right way" either never comes, due to projects getting phased out, or it gets fixed during a cataclysmic upheaval, when it has become an immediate concern.

The case in point is my mail server. We have an A and a B mx record. Originally the B MX just stored mail until the A came back up, then it would get delivered. Everyone checks mail on A, so it can't really be down during the day, and about 6 months ago, the office that B was at relocated and B was never set up. This left us with just A. To make matters worse, A was old enough that it was physically located in our backup site, which used to be our primary site. This was suboptimal. Of course there was talk about moving it to the primary site, but when could a maintenance window be created? And we'd risk the entire period of non-connectivity when it was being moved. No, management said, lets just leave it where it was.

Great strategy. It actually worked fine though, until this weekend.

I came in on Saturday, ready to do some major work on the blade systems I'm building for our new site. I sat down at my desk, ready to dive into work. Since I was alone, Raiders of the Lost Ark was playing on the laptop. I had just logged into the first server when the lights went off, and the telltale screech and whine from the server room told me that we'd lost main power.

In Granville, OH, that's not a strange thing. We've got backup AC and a backup generator, so I wasn't worried. It does have to be manually started, so I jogged into the server room and turned on the CFL floor lamp. At least I tried to. I looked at the generator control panel and it confirmed my fears. No generator power.

I tried for several minutes to start it, but nothing gave me the impression that anything would change, so I called my boss to let him know the situation, and that I was going to start shutting down machines. Since the only critical thing was mail, I suggested that he change DNS to point to an as-yet unassigned IP at the colocation, and that I could setup a postfix process there to queue the mail. He said that it would work, but he suggested an alternative approach.

Why not relocate the physical mail server to the colocation? A lightbulb went off. Of course, not only could I take care of that long standing problem, but because there was no power at all in the datacenter, the normal policy of no-downtime-for-repairs-and-upgrades was out the window.

The next morning, I left work to go home at 5am. The previous 15 hours had been spent completely rehauling the backup datacenter. With the mail relocated to the primary facility, once the power came on in the backup, I had free reign to cull everything unnecessary that had been accumulating.

There is now a pile of cables covering a square yard or so around 6 inches deep of power, ethernet, and copper/fiber cables. There are something like 96 ports worth of switches that I took out, multiple servers, KVMs, fiber switches, and general cruft. The servers are also arranged so that no half-depth servers are hiding between full depth. That was always a pet peeve of mine.

I thought about it while I was doing this, and if fighting normal issues is considered firefighting, then what I went through should have been considered forestfire fighting. And just like a forest fire, good can come from it. It takes the massive heat of a forestfire to crack open some pine cones. It also takes massive infrastructure downtime to make significant changes.