Switching on-call in Nagios

First of, Nagios is great. It's not everything but it is great and works very well for almost all monitoring needs and if you want reporting as well, there are options to make that work too. It's trading on-call that isn't easy.

One thing that Nagios does not work well for however is switching on-call. Sure you can set up lengthy schedules and such, but if you are in an environment where people trade and want to be able to easily switch, then Nagios makes you go through the old edit - nagios -v nagios.cfg - service nagios reload cycle.

So what is a sysadmin or dba to do? Well, here is how I worked around the issue.

My current employer has a ticketing system, where the duty for monitoring incoming tickets is traded along with who gets the alarms from Nagios. To take over the ticket system monitoring duties, you pull up a webpage, select your name (or someones) from a list and click the OK button. Voila!

Sure would be nice if this would work in Nagios as well.

Since I despise tedium, I decided to just make Nagios work in concert with the ticketing system.

How does it work?

Well, basically what happens is that a Nagios check polls the ticketing system for the current on-call "victim" and compares it to the contents of the on-call file. Nagios itself will always send alarms to the same on-call contact, but the email and pager for that contact are updated as needed via a Nagios event handler. Yep, might as well make Nagios do the heavy lifting.

What was needed?

The first thing I did was to setup a contact file for each of the people participating in on-call. They looked something like this:

 define contact{
        contact_name                    mj
        alias                           Matthias Johnson
        service_notification_period     24x7
        host_notification_period        24x7
        service_notification_options    w,u,c,r
        host_notification_options       d,r
        service_notification_commands   notify-by-email
        host_notification_commands      host-notify-by-email
        email                           mj@company.com
        pager                           123456789@serviceprovider.com
        }

Each of those files was saved as using the contact_name from the configuration file.

Next I wrote an event handler. That's right Nagios will be doing the work itself. The event handler did a couple of things.

The primary job was to update the email and pager information for the on-call contact, but that wasn't all. If Nagios would have just done that, then the on-call target would in all likelyhood have received a lot of duplicate email. That I consider in poor taste. On-call is tough enough without having to enjoy duplicate notifications!

Therefore, the event handler also built out a group of admins containing everyone in the group except the on-call user.

So the oncall contact would look something like this:

define contact{
        contact_name                    oncall-syseng
        alias                           Oncall Syseng
        service_notification_period     24x7
        host_notification_period        24x7
        service_notification_options    w,u,c,r
        host_notification_options       d,r
        service_notification_commands   notify-by-email,notify-by-epager
        host_notification_commands      host-notify-by-email,host-notify-by-epager
        email                           mj@company.com
        pager                           123456789@serviceprovider.com
}

define contactgroup{
        contactgroup_name       syseng
        alias                   Systems Group
        members                 oncall-syseng,jim,bob
}

Notice how the mj is not a member of the syseng group and how oncall-syseng contact has notify-by-epager and host-notify-by-epager, where the mj earlier did not!

The last thing needed was a check that looked at the on-call page in our ticketing system and compared the person listed there to the one configured in email for the nagios on-call superhero.

What actually happens?

With all that in place only a check was needed to query the ticketing system, or any simple web page that lists the on-call enthusiasts.

The check looks for who is in the on-call contact file and if it's different from what the web page says goes to warning. In our case it triggers an immediate notification to the current on-call staff, telling them that someone else is taking over.

Then, in short order, the event handler fires and updates the oncall contact in the manner described above and reloads nagios. That's right. The Nagios event handler calls the /etc/init.d/nagios reload itself. This will likely require sudo, if you installed nagios in the suggested fashion. Of course before the reload, the event handler performs a verify of the configuration files and if anything isn't right restores the last oncall configuration, which was created before the updates took place.

What else?

The code for all this is pretty specfic to our environment, but the approach should be adaptable. In our case we put the contacts into a subdirectory and prefix all contact files with 'syseng-' and so we make it possible to add other groups into this same scheme. In many environments this could make it easy to build out a 'network-', 'systems-' and 'dba-' organization, with different on-call staff in each group.

Another option we didn't take very far was that the contact files themselves could likely be used to populate a web interface, but since we had the ticketing system, we didn't worry too much about that ....

Hope all of this makes sense and is of value to someone.