Tuesday, November 11, 2008

Where to put your system monitoring

I'm getting ready to implement my new Nagios monitoring system, and I've been researching best practices.

My current setup is that I have 3 “data sites”, which I consider to be physical locations where servers are kept. The primary site, the backup site, and the soon-to-be-primary site. When the new site becomes primary, the current primary will become backup, and the backup site will go away. Here's how they're setup:



They are geographically diverse. and as you can see, there is limited bandwidth between them.

Nagios is currently set up at the Backup site, and has remained unchanged for the most part since the backup site was the primary (and only) data site. This is not ideal, for a bunch of reasons.

Because of the way Nagios queries things, it is at the mercy of the networking devices between it and the target. If the router in-between goes down, then Nagios sees everything beyond that router as down. You can alleviate the most annoying side effect (dozens or hundreds of alerts) by assigning things beyond the router to be "children" of the router, in which case Nagios will only let you know that the parent is unavailable.

Aside from not having status checking on entire segments of our network in the event of an outage, what if the segment with no network access hosts your mail server? I've had this happen before, and it's disturbing to suddenly receive 2 hours worth of 'down' notifications at 3am. Not a good thing.

To circumvent this type of behavior, I'm going to be employing one nagios at each location:


In the event that one of my sites loses network access, I've still got another host to send messages.

If you monitor, how do you guys arrange your monitoring? If you don't, any plans to start?