Tuesday, November 11, 2008

Where to put your system monitoring

I'm getting ready to implement my new Nagios monitoring system, and I've been researching best practices.

My current setup is that I have 3 “data sites”, which I consider to be physical locations where servers are kept. The primary site, the backup site, and the soon-to-be-primary site. When the new site becomes primary, the current primary will become backup, and the backup site will go away. Here's how they're setup:



They are geographically diverse. and as you can see, there is limited bandwidth between them.

Nagios is currently set up at the Backup site, and has remained unchanged for the most part since the backup site was the primary (and only) data site. This is not ideal, for a bunch of reasons.

Because of the way Nagios queries things, it is at the mercy of the networking devices between it and the target. If the router in-between goes down, then Nagios sees everything beyond that router as down. You can alleviate the most annoying side effect (dozens or hundreds of alerts) by assigning things beyond the router to be "children" of the router, in which case Nagios will only let you know that the parent is unavailable.

Aside from not having status checking on entire segments of our network in the event of an outage, what if the segment with no network access hosts your mail server? I've had this happen before, and it's disturbing to suddenly receive 2 hours worth of 'down' notifications at 3am. Not a good thing.

To circumvent this type of behavior, I'm going to be employing one nagios at each location:


In the event that one of my sites loses network access, I've still got another host to send messages.

If you monitor, how do you guys arrange your monitoring? If you don't, any plans to start?

3 comments:

janke ' or 1=1 -- said...

What we settled on was having the monitoring system at the primary site monitor everything.

The monitoring system at the backup site only monitors the monitoring system at the primary site, the routers and firewalls at the primary site, and a few key systems at the primary site, but not everything.

Similar to your setup.

c!w said...

Put a gsm-modem into your nagios-server to send notifications in case your network/mailserver is down or unreachable..!

Ernie Oporto said...

If you have Nagios at both locations, then you won't miss out on trending when the link to the remote site comes back up, which could be a real help in figuring out why the link went down in the first place.