May 12, 2005

Empty Nagios Notifications Caused by Lack of Disk Space

A few days ago I received a few blank emails from my Nagios installation, with no content or indication what was wrong. After a quick glance I couldn't detect anything wrong with its configuration. After a few minutes of breaking the problem down, I discovered that the problem wasn't with Nagios at all.

The other day I was doing some prep work for a presentation when I received several emails from Nagios. I immediately noticed that they were blank. The From field was normal:

From: Nagios Monitor <nagios@localhost>

Unfortunately there were no values in the To or Subject fields, nor did the body contain any data. Now this particular instance of Nagios is rather well tuned so that I don't receive a lot of unnecessary email. As such, all the warnings that I do receive are usually important and must be dealt with as soon as possible. Empty emails certainly do not help out in this regard. Here are the steps I took to troubleshoot the cause of the emails.

The first step was to login to the Nagios GUI. After logging in I proceded to the Notifications page to see if Nagios had indeed initiated the emails in question. Sure enough, there were a few notices whose timestamps matches the emails that I had received. The next step would be to check the notification methods to make sure that they were working. Now I recently had updated the OS on the system. Although not likely, it's possible that printf got hosed up during the upgrade. Perhaps some slightly different command line syntax. Checking the misccommands.cfg file, I ran the notify-by-email command (minus the pipe to mail) from bash by hand to verify that the emails would be formatted properly. One rogue character or two in the printf statement could easily create some null output too, although the config file hadn't been changed in months. This wasn't the source of the problem either. So I ran the entire command (including the pipe to mail) and discovered the culprit.

I received an error message stating that the process couldn't write to /tmp. A quick df revealed that the partition was completely full. After removing some stale files/logs from the partion, Nagios began to send out proper emails again.

Naturally diskspace checks are run on all of the servers. So why wasn't I notified about low disk space on the Nagios box itself? As it turns out it was oversight. I had failed to include that particular server in the Linux hostgroup, on which all of the disk checks are run. A simple oversight, but an easy one to make once Nagios configs get large. While it shouldn't be a suprise to veteran *nix junkies, it helps to keep /tmp free of debris since you never know which processes are going to attempt to use it.

Posted by alexm at May 12, 2005 07:16 PM.
Send comments/suggestions to contact@moundalexis.com.
Add to del.icio.us | Digg this | Subscribe to this feed
Comments