I was talking to a friend of mine yesterday. He's a junior admin in a Windows shop (not that it makes any difference), but we were discussing the age of his servers, their reliability, and what he was doing about it. I asked the million dollar question: "What would you do if one of the servers died right now?" The answer was chilling, more so to him than me. "I have no idea".
After going more in-depth with the discussion, I learned that, while he did have some general ideas for some services, there was no plan laid out, and what's more, there wasn't even a list of servers anywhere.
We immediately adjourned to Google Docs, where I quickly laid out a spreadsheet with some common fields, and he filled it in. He was surprised. "Wow! We really do have more servers than people".
Maintaining a list of servers is only the beginning, but it's an important foundation for every other part of system and infrastructure management. Until you have your server list, you can't implement host and service checking. You can't really develop a disaster recovery plan until you know what your assets (and liabilities) are.
These are important steps to taking your system to the next level. To really increase availability, you've got to know where you stand.