Tuesday, March 10, 2009

Pre-emptive Troubleshooting

Troubleshooting is a very reactive process. By its nature, you're fixing an already existing problem. As good as it is to be able to troubleshoot, it's better to prevent weird problems from cropping up in the first place.

As sysadmins, we have numerous ways to do this. First, as Michael Jenke very sensibly suggests, is to use structured systems management, by administering via script instead of editing files by hand, or worse yet, clicking through the interface.

Another very potent tool that Chris Siebenmann brings up today is using checklists to perform complex tasks. (Incidentally, Chris mentions a term I've never heard before..."rubber duck debugging". I think I'm going to try to expense one of these for troubleshooting purposes)

I'm a firm believer in checklists for anything that isn't (or can't be) automated. On our internal wiki, I have checklists for things like adding and removing users from the infrastructure, adding new machines, etc etc. They're great, and I don't have to "remember" everything that needs done, I can just do it and it's always accurate. And if I've left something off the list, I add it to the list, and it's more accurate.

I wasn't always such a checklist person. It took a while for their usefulness to sink in. Tom Limoncelli does a great job of explaining why in his blog post, Transforming an art into a science, where he explains that back in the bad old days, planes kept crashing until pilots started pre-takeoff checklists. Similarly, Boston.com has an article about doctors using checklists, which resulted in 36% fewer complications and deaths in the operating room.

Checklists take complex, fun tasks and make them boring.