Monitoring an Enterprise Hadoop Cluster using Ganglia and Nagios


The need for efficient monitoring tools for compute resources is more important than ever. In this article I will walk you through:

  • Installing and configuring the basic Ganglia setup.
  • How to use the Python modules to extend functionality with IPMI (the Intelligent Platform Management Interface).
  • How to use Ganglia host spoofing to monitor IPMI.

Our goal is to set up a monitoring system for a Linux cluster to support three different monitoring views :

  • An application user can view how full a queue is and available nodes for running jobs.
  • An Admin user is alerted of system failures or receive a red error light via the Nagios Web interface. They also receive an email if nodes go down or temperatures too high.
  • A Systems Engineer can graph data, report on cluster utilization and make decisions on future hardware acquisitions.

Ganglia

Ganglia is an open source monitoring project, designed to scale to thousands of nodes, that started at UC Berkeley. Each machine runs a daemon called gmond which collects and sends the metrics (like processor speed, memory usage, etc.) it gleans from the operating system to a specified host. The host which receives all the metrics can display them and can pass on a condensed form of them up a hierarchy. This hierarchical schema is what allows Ganglia to scale so well. gmond has very little overhead which makes it a great piece of code to run on every machine in the cluster without impacting user performance.

There are times when all of this data collection can impact node performance. “Jitter” in the network (as this is called) is when lots of little messages keep coming at the same time. We have found that by lockstepping the nodes’ clocks, this can be avoided.

Installing Ganglia

There are many articles and resources on the Internet that will show you how to install Ganglia.

Prerequisites

Provided you have your yum repository set up, installing prereqs should be easy

yum -y install apr-devel apr-util check-devel cairo-devel pango-devel libxml2-devel
  rpmbuild glib2-devel dbus-devel freetype-devel fontconfig-devel gcc-c++ expat-devel
  python-devel libXrender-devel

 

One prerequisite is not in the Red Hat repository:

wget \ http://ga13.files.bigpond.com:4040/fedora/linux/releases/9/Everything/source/
       SRPMS/libconfuse-2.6-1.fc9.src.rpm

rpmbuild --rebuild libconfuse-2.6-1.fc9.src.rpm
cd /usr/src/redhat/RPMS/x86_64/
rpm -ivh libconfuse-devel-2.6-1.x86_64.rpm libconfuse-2.6-1.x86_64.rpm

Remember, mirrors often change. If this doesn’t work, then use a search engine to find the libconfuse-2.6.-1.fc9 source RPM.

RRDTool

RRDTool is Round Robin Database Tool. It was created by Tobias Oetiker and provides an engine for many high performance monitoring tools. Ganglia is one of them.

To install Ganglia, we first need to have RRDTool running on our monitoring server. RRDTool provides two functions that are leveraged by other programs:

  • It stores data in a Round Robin Database. As the data captured gets older, the resolution becomes less refined. This keeps the footprint small and still useful in most cases.
  • It can create graphs by using command-line arguments to generate them from the data it has captured.

To install RRDTool, run the following (tested on versions 1.3.4 and 1.3.6):

cd /tmp/
wget http://oss.oetiker.ch/rrdtool/pub/rrdtool.tar.gz
tar zxvf rrdtool*
cd rrdtool-*
./configure --prefix=/usr
make -j8
make install
which rrdtool
ldconfig  # make sure you have the new rrdtool libraries linked.

Now that you have all prerequisites, let’s install Ganglia. Download the ganglia-3.1.1.tar.gz file and place it in the /tmp directory of your monitoring server. Then do the following:

cd /tmp/
tar zxvf ganglia*gz
cd ganglia-3.1.1/
./configure --with-gmetad
make -j8
make install

 

Configuring Ganglia

Now that the basic installation is done, there are several configuration items.

Do the following steps:

  1. Command line file manipulations.
  2. Modify /etc/ganglia/gmond.conf.
  3. Take care of multi-homed machines.
  4. Start it up on a management server.

Step 1: Command line file manipulations

As shown in the following:

cd /tmp/ganglia-3.1.1/   # you should already be in this directory
mkdir -p /var/www/html/ganglia/  # make sure you have apache installed
cp -a web/* /var/www/html/ganglia/   # this is the web interface
cp gmetad/gmetad.init /etc/rc.d/init.d/gmetad  # startup script
cp gmond/gmond.init /etc/rc.d/init.d/gmond
mkdir /etc/ganglia  # where config files go
gmond -t | tee /etc/ganglia/gmond.conf  # generate initial gmond config
cp gmetad/gmetad.conf /etc/ganglia/  # initial gmetad configuration
mkdir -p /var/lib/ganglia/rrds  # place where RRDTool graphs will be stored
chown nobody:nobody /var/lib/ganglia/rrds  # make sure RRDTool can write here.
chkconfig --add gmetad  # make sure gmetad starts up at boot time
chkconfig --add gmond # make sure gmond starts up at boot time

Step 2: Modify /etc/ganglia/gmond.conf

Now you can modify /etc/ganglia/gmond.conf to name your cluster. Suppose your cluster name is “fathom”; then you would change name = "unspecified" to name = "fathom".

Step 3: Take care of multi-homed machines

In our sample cluster, eth0 is the public IP address of my system. Our monitoring server talks to the nodes in the cluster via the private network eth1. We need to configure Ganglia multicasting to eth1. This can be done by creating the file /etc/sysconfig/network-scripts/route-eth1. Add the contents 239.2.11.71 dev eth1.

You can then restart the network and make sure routes shows this IP going through eth1 using service network restart. Note: You should put in 239.2.11.71 because that is the ganglia default multicast channel. Change it if you make the channel different or add more.

Step 4: Start it up on a management server

Now you can start it all up on the monitoring server:

service gmond start
service gmetad start
service httpd restart

Pull up a Web browser and point it to the management server at http://localhost/ganglia. You’ll see that your management server is now being monitored. You’ll also see several metrics being monitored and graphed. One of the most useful is that you can monitor the load on this machine. Here is what mine looks like:

Figure 1. Monitoring load

Monitoring load

Nothing much happening here, the machine is just idling.

Get Ganglia on the nodes

Up to now, we’ve accomplished running Ganglia on the management server; now we have to care more about what the compute nodes all look like. It turns out that you can put Ganglia on the compute nodes by just copying a few files. This is something you can add to a post install script if you use Kickstart or something you can add to your other update tools.

The quick and dirty way to do it is like this: Create a file with all your host names. Suppose you have nodes deathstar001deathstar100. Then you would have a file called /tmp/mynodes that looks like this:

deathstar001
deathstar002
...skip a few...
deathstar099
deathstar100

Now just run this:

# for i in `cat /tmp/mynodes`; do 
scp /usr/sbin/gmond $i:/usr/sbin/gmond
ssh $i mkdir -p /etc/ganglia/
scp /etc/ganglia/gmond.conf $i:/etc/ganglia/
scp /etc/init.d/gmond $i:/etc/init.d/
scp /usr/lib64/libganglia-3.1.1.so.0 $i:/usr/lib64/
scp /lib64/libexpat.so.0 $i:/lib64/
scp /usr/lib64/libconfuse.so.0 $i:/usr/lib64/
scp /usr/lib64/libapr-1.so.0 $i:/usr/lib64/
scp -r /usr/lib64/ganglia $i:/usr/lib64/
ssh $i service gmond start
done

You can restart gmetad, refresh your Web browser, and you should see your nodes now showing up in the list.

Some possible issues you might encounter:

  • You may need to explicitly set the static route as in the earlier step 3 on the nodes as well.
  • You may have firewalls blocking the ports. gmond runs on port 8649. If gmond is running on a machine you should be able to run the commandtelnet localhost 8649. And see a bunch of XML output scroll down your screen.

Observing Ganglia

Many system engineers have a hard time understanding their own workload or job behavior. They may have custom code or haven’t done research to see what their commercial products run. Ganglia can help profile applications.

We’ll use Ganglia to examine the attributes of running the Linpack benchmark. Figure 2 shows a time span where I launched three different Linpack jobs.

Figure 2. Watching over Linpack

Watching over Linpack

As you can see from this graph, when the job starts there is some activity on the network when the job launches. What is interesting, however, is that towards the end of the job, the network traffic increases quite a bit. If you knew nothing about Linpack, you could at least say this: Network traffic increases at the end of the job.

Figure 3 and Figure 4 show CPU and memory utilization respectively. From here you can see that we are pushing the limits of the processor and that our memory utilization is pretty high too.

Figure 3. CPU usage

CPU usage

Figure 4. Memory usage

Memory usage

These graphs give us great insight to the application we’re running: We’re using lots of CPU and memory and creating more network traffic towards the end of the running job. There are still a lot of other attributes about this job that we don’t know, but this gives us a great start.

Knowing these things can help make better purchasing decisions in the future when it comes to buying more hardware. Of course, no one buys hardware just to run Linpack … right?

Extending capability

The basic Ganglia install has given us a lot of cool information. Using Ganglia’s plug-ins gives us two ways to add more capability:

  • Through the addition of in-band plug-ins.
  • Through the addition of out-of-band spoofing from some other source.

The first method has been the common practice in Ganglia for a while. The second method is a more recent development and overlaps with Nagios in terms of functionality. Let’s explore the two methods briefly with a practical example.

In-band plug-ins

In-band plug-ins can happen in two ways.

  • Use a cron-job method and call Ganglia’s gmetric command to input data.
  • Use the new Python module plug-ins and script it.

The first method was the common way we did it in the past and I’ll more about this in the next section on out-of- band plug-ins. The problem with it is that it wasn’t as clean to do. Ganglia 3.1.x added Python and C module plug-ins to make it seem more natural to extend Ganglia. Right now, I’m going to show you the second method.

First, enable Python plug-ins with Ganglia. Do the following:

  1. Edit the /etc/ganglia/gmond.conf file.

If you open it up, then you’ll notice about a quarter of the way down there is a section called modules that looks something like this:

modules {
    module {
           name = "core_metrics"
     }
     ...
}

We’re going to add another module to the modules section. The one you should stick in is this:

  module {
    name = "python_module"
    path = "modpython.so"
    params = "/usr/lib64/ganglia/python_modules/"
  }

On my gmond.conf I added the previous code stanza at line 90. This allows Ganglia to use the Python modules. Also, a few lines below that after the statement include ('/etc/ganglia/conf.d/*.conf'), add the line include ('/etc/ganglia/conf.d/*.pyconf'). These include the definitions of the things we are about to add.

  1. Make some directories.

Like so:

mkdir /etc/ganglia/conf.d
mkdir /usr/lib64/ganglia/python_modules
  1. Repeat 1 and 2 on all your nodes.

To do that,

  • Copy the new gmond.conf to each node to be monitored.
  • Create the two directories as in step 2 on each node to be monitored so that they too can use the Python extensions.

Now that the nodes are set up to run Python modules, let’s create a new one. In this example we’re going to add a plug-in that uses the Linux IPMI drivers. If you are not familiar with IPMI and you work with modern Intel and AMD machines then please learn about it (in Resources).

We are going to use the open source IPMItool to communicate with the IPMI device on the local machine. There are several other choices like OpenIPMI or freeipmi. This is just an example, so if you prefer to use another one, go right on ahead.

Before starting work on Ganglia, make sure that IPMItool works on your machine. Run the command ipmitool -c sdr type temperature | sed 's/ /_/g'; if that command doesn’t work, try loading the IPMI device drivers and run it again:

modprobe ipmi_msghandler
modprobe ipmi_si
modprobe ipmi_devintf

After running the ipmitool command my output shows

Ambient_Temp,20,degrees_C,ok
CPU_1_Temp,20,degrees_C,ok
CPU_2_Temp,21,degrees_C,ok

So in my Ganglia plug-in, I’m just going to monitor ambient temperature. I’ve created a very poorly written plug-in called ambientTemp.py that uses IPMI based on a plug-in found on the Ganglia wiki that does this:

Listing 1. The poorly written Python plug-in ambientTemp.py
import os
def temp_handler(name):
  # our commands we're going to execute
  sdrfile = "/tmp/sdr.dump"
  ipmitool = "/usr/bin/ipmitool"
  # Before you run this Load the IPMI drivers:
  # modprobe ipmi_msghandler
  # modprobe ipmi_si
  # modprobe ipmi_devintf
  # you'll also need to change permissions of /dev/ipmi0 for nobody
  # chown nobody:nobody /dev/ipmi0
  # put the above in /etc/rc.d/rc.local

  foo = os.path.exists(sdrfile)
  if os.path.exists(sdrfile) != True:
    os.system(ipmitool + ' sdr dump ' + sdrfile)

  if os.path.exists(sdrfile):
    ipmicmd = ipmitool + " -S " + sdrfile + " -c sdr"
  else:
    print "file does not exist... oops!"
    ipmicmd = ipmitool + " -c sdr"
  cmd = ipmicmd + " type temperature | sed 's/ /_/g' "
  cmd = cmd + " | awk -F, '/Ambient/ {print $2}' "
  #print cmd
  entries = os.popen(cmd)
  for l in entries:
    line = l.split()
  # print line
  return int(line[0])

def metric_init(params):
    global descriptors

    temp = {'name': 'Ambient Temp',
        'call_back': temp_handler,
        'time_max': 90,
        'value_type': 'uint',
        'units': 'C',
        'slope': 'both',
        'format': '%u',
        'description': 'Ambient Temperature of host through IPMI',
        'groups': 'IPMI In Band'}

    descriptors = [temp]

    return descriptors

def metric_cleanup():
    '''Clean up the metric module.'''
    pass

#This code is for debugging and unit testing
if __name__ == '__main__':
    metric_init(None)
    for d in descriptors:
        v = d['call_back'](d['name'])
        print 'value for %s is %u' % (d['name'],  v)

Copy Listing 1 and place it into /usr/lib64/ganglia/python_modules/ambientTemp.py. Do this for all nodes in the cluster.

Now that we’ve added the script to all the nodes in the cluster, tell Ganglia how to execute the script. Create a new file called /etc/ganglia/conf.d/ambientTemp.pyconf The contents are as follows:

Listing 2. Ambient.Temp.pyconf
modules {
  module {
    name = "Ambient Temp"
    language = "python"
  }
}

collection_group {
  collect_every = 10
  time_threshold = 50
  metric {
    name = "Ambient Temp"
    title = "Ambient Temperature"
    value_threshold = 70
  }
}

Save Listing 2 on all nodes.

The last thing that needs to be done before restarting gmond is to change the permissions of the IPMI device so that nobody can perform operations to it. This will make your IPMI interface extremely vulnerable to malicious people!

This is only an example: chown nobody:nobody /dev/ipmi0.

Now restart gmond everywhere. If you get this running then you should be able to refresh your Web browser and see something like the following:

Figure 5. IPMI in-band metrics

IPMI in-band metrics

The nice thing about in-band metrics is they allow you to run programs on the hosts and feed information up the chain through the same collecting mechanism other metrics use. The drawback to this approach, especially for IPMI, is that there is considerable configuration required on the hosts to make it work.

Notice that we had to make sure the script was written in Python, the configuration file was there, and that the gmond.conf was set correctly. We only did one metric! Just think of all you need to do to write other metrics! Doing this on every host for every metric can get tiresome. IPMI is an out-of-band tool so there’s got to be a better way right? Yes there is.

Out-of-band plug-ins (host spoofing)

Host spoofing is just the tool we need. Here we use the powerful gmetric and tell it which hosts we’re running on — gmetric is a command-line tool to insert information into Ganglia. In this way you can monitor anything you want.

The best part about gmetric? There are tons of scripts already written.

As a learning experience, I’m going to show you how to reinvent how to run ipmitool to remotely access machines:

  1. Make sure ipmitool works on its own out of band.

I have set the BMC (the chip on the target machine) so that I can run IPMI commands on it. For example: My monitoring hosts name is redhouse. From redhouse I want to monitor all other nodes in the cluster. Redhouse is where gmetad runs and where I point my Web browser to access all of the Ganglia information.

One of the nodes in my cluster has the host name x01. I set the BMC of x01 to have an IP address that resolves to the host x01-bmc. Here I try to access that host remotely:

# ipmitool -I lanplus -H x01-bmc -U USERID -P PASSW0RD sdr dump \ /tmp/x01.sdr
Dumping Sensor Data Repository to '/tmp/x01.sdr'
# ipmitool -I lanplus -H x01-bmc -U USERID -P PASSW0RD -S /tmp/x01.sdr \ sdr type 
                                                                            Temperature
Ambient Temp     | 32h | ok  | 12.1 | 20 degrees C
CPU 1 Temp       | 98h | ok  |  3.1 | 20 degrees C
CPU 2 Temp       | 99h | ok  |  3.2 | 21 degrees C

That looks good. Now let’s put it in a script to feed to gmetric.

  1. Make a script that uses ipmitool to feed into gmetric.

We created the following script /usr/local/bin/ipmi-ganglia.pl and put it on the monitoring server:

#!/usr/bin/perl
# vallard@us.ibm.com
use strict;  # to keep things clean... er cleaner
use Socket;  # to resolve host names into IP addresses

# code to clean up after forks
use POSIX ":sys_wait_h";
# nodeFile: is just a plain text file with a list of nodes:
# e.g:
# node01
# node02
# ...
# nodexx
my $nodeFile = "/usr/local/bin/nodes";
# gmetric binary
my $gmetric = "/usr/bin/gmetric";
#ipmitool binary
my $ipmi = "/usr/bin/ipmitool";
# userid for BMCs
my $u = "xcat";
# password for BMCs
my $p = "f00bar";
# open the nodes file and iterate through each node
open(FH, "$nodeFile") or die "can't open $nodeFile";
while(my $node = <FH>){
  # fork so each remote data call is done in parallel
  if(my $pid = fork()){
    # parent process
    next;
  }
  # child process begins here
  chomp($node);  # get rid of new line
  # resolve node's IP address for spoofing
  my $ip;
  my $pip = gethostbyname($node);
  if(defined $pip){
    $ip = inet_ntoa($pip);
  }else{
    print "Can't get IP for $node!\n";
    exit 1;
  }
  # check if the SDR cache file exists.
  my $ipmiCmd;
  unless(-f "/tmp/$node.sdr"){
    # no SDR cache, so try to create it...
    $ipmiCmd = "$ipmi -I lan -H $node-bmc -U $u -P $p sdr dump /tmp/$node.sdr";
    `$ipmiCmd`;
  }
  if(-f "/tmp/$node.sdr"){
    # run the command against the cache so that its faster
    $ipmiCmd = "$ipmi -I lan -H $node-bmc -U $u -P $p -S /tmp/$node.sdr sdr type 
                                                                       Temperature ";
    # put all the output into the @out array
    my @out = `$ipmiCmd`;
    # iterate through each @out entry.
    foreach(@out){
      # each output line looks like this:
      # Ambient Temp     | 32h | ok  | 12.1 | 25 degrees C
      # so we parse it out
      chomp(); # get rid of the new line
      # grap the first and 5th fields.  (Description and Temp)
      my ($descr, undef, undef, undef,$temp) = split(/\|/);
      # get rid of white space in description
      $descr =~ s/ //g;
      # grap just the temp, (We assume C anyway)
      $temp = (split(' ', $temp))[0];
      # make sure that temperature is a number:
      if($temp =~ /^\d+/ ){
        #print "$node: $descr $temp\n";
        my $gcmd = "$gmetric -n '$descr' -v $temp -t int16 -u Celcius -S $ip:$node";
        `$gcmd`;
      }
    }
  }
  # Child Thread done and exits.
  exit;
}
# wait for all forks to end...
while(waitpid(-1,WNOHANG) != -1){
  1;
}

Aside from all the parsing, this script just runs the ipmitool command and grabs temperatures. It then puts those values into Ganglia with thegmetric command for each of the metrics.

  1. Run the script as a cron job.

Run crontab -e. I added the following entry to run every 30 minutes: 30 * * * * /usr/local/bin/ipmi-ganglia.sh. You may want to make it happen more often or less.

  1. Open Ganglia and look at the results.

Opening up the Ganglia Web browser and looking at the graphs of one of the nodes, you can see that nodes were spoofed and were updated in each nodes entry:

Figure 6. The no_group metrics

The no_group metrics

One of the drawbacks to spoofing is that the category goes in the no_group metrics group. gmetric doesn’t appear to have a way to change the groupings in a nice way like in the in-band version.

Installing Nagios

The effort to get Nagios rolling on your machine is well documented on the Internet. Since I tend to install it a lot in different environments, I wrote a script to do it.

First you need to download two packages:

  • Nagios (tested with version 3.0.6)
  • Nagios-plugins (tested with version 1.4.13)

The add-ons include:

  • The Nagios Event Log, which allows for monitoring Windows event logs
  • The NRPE, which provides a lot of Ganglia functionality

Get the tarballs and place them in a directory. For example, I have the following three files in /tmp:

  • nagios-3.0.6.tar.gz
  • nagios-plugins-1.4.13.tar.gz
  • naginstall.sh

Listing 1 shows the naginstall.sh install script:

Listing 1. The naginstall.sh script
#!/bin/ksh

NAGIOSSRC=nagios-3.0.6
NAGIOSPLUGINSRC=nagios-plugins-1.4.13
NAGIOSCONTACTSCFG=/usr/local/nagios/etc/objects/contacts.cfg
NAGIOSPASSWD=/usr/local/nagios/etc/htpasswd.users
PASSWD=cluster
OS=foo

function buildNagiosPlug {

  if [ -e $NAGIOSPLUGINSRC.tar.gz ]
  then
    echo "found $NAGIOSPLUGINSRC.tar.gz  building and installing Nagios"
  else
    echo "could not find $NAGIOSPLUGINSRC.tar.gz in current directory."
    echo "Please run $0 in the same directory as the source files."
    exit 1
  fi
  echo "Extracting Nagios Plugins..."
  tar zxf $NAGIOSPLUGINSRC.tar.gz
  cd $NAGIOSPLUGINSRC
  echo "Configuring Nagios Plugins..."
  if ./configure --with-nagios-user=nagios --with-nagios-group=nagios
      -prefix=/usr/local/nagios > config.LOG.$$ 2>&1
  then
    echo "Making Nagios Plugins..."
    if make -j8 > make.LOG.$$ 2>&1
    then
      make install > make.LOG.$$ 2>&1
    else
      echo "Make failed of Nagios plugins.  See $NAGIOSPLUGINSRC/make.LOG.$$"
      exit 1
    fi
  else
    echo "configure of Nagios plugins failed.  See config.LOG.$$"
    exit 1
  fi
  echo "Successfully built and installed Nagios Plugins!"
  cd ..

}

function buildNagios {
  if [ -e $NAGIOSSRC.tar.gz ]
  then
    echo "found $NAGIOSSRC.tar.gz  building and installing Nagios"
  else
    echo "could not find $NAGIOSSRC.tar.gz in current directory."
    echo "Please run $0 in the same directory as the source files."
    exit 1
  fi
  echo "Extracting Nagios..."
  tar zxf $NAGIOSSRC.tar.gz
  cd $NAGIOSSRC
  echo "Configuring Nagios..."
  if ./configure --with-command-group=nagcmd > config.LOG.$$ 2>&1
  then
    echo "Making Nagios..."
    if make all -j8 > make.LOG.$$ 2>&1
    then
      make install > make.LOG.$$ 2>&1
      make install-init > make.LOG.$$ 2>&1
      make install-config > make.LOG.$$ 2>&1
      make install-commandmode > make.LOG.$$ 2>&1
      make install-webconf > make.LOG.$$ 2>&1
    else
      echo "make all failed.  See log:"
      echo "$NAGIOSSRC/make.LOG.$$"
      exit 1
    fi
  else
    echo "configure of Nagios failed.  Please read $NAGIOSSRC/config.LOG.$$ for details."
    exit 1
  fi
  echo "Done Making Nagios!"
  cd ..
}


function configNagios {
  echo "We'll now configure Nagios."
  LOOP=1
  while [[ $LOOP -eq 1 ]]
  do
    echo "You'll need to put in a user name.  This should be the person"
    echo "who will be receiving alerts.  This person should have an account"
    echo "on this server.  "
    print "Type in the userid of the person who will receive alerts (e.g. bob)> \c"
    read NAME
    print "What is ${NAME}'s email?> \c"
    read EMAIL
    echo
    echo
    echo "Nagios alerts will be sent to $NAME at $EMAIL"
    print "Is this correct? [y/N] \c"
    read YN
    if [[ "$YN" = "y" ]]
    then
      LOOP=0
    fi
  done
  if [ -r $NAGIOSCONTACTSCFG ]
  then
    perl -pi -e "s/nagiosadmin/$NAME/g" $NAGIOSCONTACTSCFG
    EMAIL=$(echo $EMAIL | sed s/\@/\\\\@/g)
    perl -pi -e "s/nagios\@localhost/$EMAIL/g" $NAGIOSCONTACTSCFG
  else
    echo "$NAGIOSCONTACTSCFG does not exist"
    exit 1
  fi

  echo "setting ${NAME}'s password to be 'cluster' in Nagios"
  echo "    you can change this later by running: "
  echo "    htpasswd -c $NAGIOSPASSWD $Name)'"
  htpasswd -bc $NAGIOSPASSWD $NAME cluster
  if [ "$OS" = "rh" ]
  then
    service httpd restart
  fi

}


function preNagios {

  if [ "$OS" = "rh" ]
  then
    echo "making sure prereqs are installed"
    yum -y install httpd gcc glibc glibc-common gd gd-devel perl-TimeDate
    /usr/sbin/useradd -m nagios
    echo $PASSWD | passwd --stdin nagios
    /usr/sbin/groupadd nagcmd
    /usr/sbin/usermod -a -G nagcmd nagios
    /usr/sbin/usermod -a -G nagcmd apache
  fi

}
function postNagios {
  if [ "$OS" = "rh" ]
  then
    chkconfig --add nagios
    chkconfig nagios on
    # touch this file so that if it doesn't exist we won't get errors
    touch /var/www/html/index.html
    service nagios start
  fi
  echo "You may now be able to access Nagios at the URL below:"
  echo "http://localhost/nagios"

}



if [ -e /etc/redhat-release ]
then
  echo "installing monitoring on Red Hat system"
  OS=rh
fi

# make sure you're root:
ID=$(id -u)
if [ "$ID" != "0" ]
then
  echo "Must run this as root!"
  exit
fi

preNagios
buildNagios
buildNagiosPlug
configNagios
postNagios

Run the script ./naginstall.sh

This code works on Red Hat systems and should run if you’ve installed all the dependencies mentioned. While running naginstall.sh, you are prompted for the user that Nagios should send alerts to. You’ll be able to add others later. Most organizations have a mail alias that will send to people in a group.

If you have problems installing, take a look at the Nagios Web page (see Resources for a link) and join the mailing list.

Configuring Nagios

So let’s pretend the script just worked for you and you installed everything perfectly. Then when the script exited successfully, you should be able to open your Web browser and see that your own local host is being monitored (like in Figure 1):

Figure 1. Screen showing your local host being monitored

Screen showing your local host being monitored

By clicking Service Detail, you can see that we are monitoring several services (like Ping, HTTP, load, users, etc. ) on the local machine. This was configured by default.

Let’s examine the service called Root Partition. This service alerts you when the root partition gets full. You can get a full understanding of how this check is working by examining the configuration files that were generated upon installation.

The master configuration file

If you used the naginstall.sh script, then the master configuration file is /usr/local/nagios/etc/nagios.cfg. This script shows several cfg_files that have additional definitions. Among them is the line:

cfg_file=/usr/local/nagios/etc/objects/localhost.cfg

If you examine this file, you will see all of the services for the localhost that are present on the Web view. This is where the default services are being configured. The Root Partition definition appears on line 77.

The hierarchy of how the root partition check is configured is shown in Figure 2.

Figure 2. How the root partition check is configured

How the root partition check is configured

First notice the inheritance scheme of Nagios objects. The definition of the Root Partition uses local-service definitions that in turn use the generic-service definitions. This defines how the service is called, how often, and other tunable parameters, etc.

The next important part of the definition is the check commands it uses. First it uses a command definition called check_local_disk. The parameters it passes are !20%!10%!/. This means that when the check_local_disk command definition reports 20%, it will issue a warning. When it hits 10%, you’ll get a critical error. The / means that it is checking the “/” partition. The check_local_disk in turn simply calls thecheck_disk command, which is located in the /usr/local/nagios/libexec directory.

This is basic idea of how configurations are set up. You can use this to create your own services to monitor and tweak any of the parameters you want. For a more in-depth appreciation of what is going on, read the documentation and try setting some of the parameters yourself.

Sign up for alerts

Now that we’re all configured, sign up for alerts. We did this already in the beginning, but if you want to change or add users you can modify the /usr/local/nagios/etc/objects/contacts.cfg file. Just change the contact name to your name and the email to your email address. Most basic Linux servers should already be set up to handle mail.

Now let’s configure other nodes.

Configure for other nodes in the grid/cloud/cluster

I have a group of nodes in my Dallas data center. I’ll create a directory where I’ll put all of my configuration files:

mkdir -p /usr/local/nagios/etc/dallas

I need to tell Nagios that my configuration files are going to go in there. I do this by modifying the nagios.cfg file, adding this line:

cfg_dir=/usr/local/nagios/etc/dallas

I’m going to be creating a couple of files here that can be pretty confusing. Figure 3 illustrates the entities and the files they belong to and shows the relationships between objects.

Figure 3. Diagram of entities and their files

Diagram of entities and their files

Keep this diagram in mind as you move through the rest of this setup and installation.

In the /usr/local/nagios/etc/dallas/nodes.cfg file, I define all the nodes and node groups. I have three types of machines to monitor:

  • Network servers (which in my case are Linux servers and have Ganglia running on them)
  • Network switches (my switches, including high-speed and Gigabit Ethernet)
  • Management devices (like blade management modules, old IBM RSA cards, BMCs, possibly smart PDUs, etc.)

I create three corresponding groups as follows:

define hostgroup {
 hostgroup_name dallas-cloud-servers
 alias Dallas Cloud Servers
}

define hostgroup
 hostgroup_name dallas-cloud-network
 alias Dallas Cloud Network Infrastructure
}

define hostgroup
 hostgroup_name dallas-cloud-management
 alias Dallas Cloud Management Devides
}

Next I create three template files with common characteristics for the nodes of these node groups to share:

define host {
        name dallas-management
        use linux-server
        hostgroups dallas-cloud-management
        # TEMPLATE!
        register 0
}


define host {
        name dallas-server
        use linux-server
        hostgroups dallas-cloud-servers
        # TEMPLATE!
        register 0
}

define host {
        name dallas-network
        use generic-switch
        hostgroups dallas-cloud-network
        # TEMPLATE!
        register 0
}

Now my individual node definitions are either dallas-management, dallas-server, or dallas-network. Here is an example of each:

define host {
 use dallas-server
 host_name x336001
 address 172.10.11.1
}
define host {
 use dallas-network
 host_name smc001
 address 172.10.0.254
}
define host {
 use dallas-management
 host_name x346002-rsa
 address 172.10.11.12
}

I generated a script to go through my list of nodes and completely populate that file with the nodes in my Dallas lab. When I restart Nagios, they’ll all be checked to see if they’re reachable. But I still have to add some other services!

You may want to restart Nagios first to make sure your settings took. If they did, then you should see some groups under the HostGroup Overviewview. If you have errors, then run:

/usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg

That will validate your file and help you find any errors.

You can now add some basic services. Following the templates from localhost, an easy one to do is to check for SSH on the dallas-cloud-servers group. Let’s start another file for that: /usr/local/nagios/etc/dallas/host-services.cfg. The easiest thing is to copy configs out of the localhost that you want monitored. I did that and added a dependency:

define service{
        use                             generic-service
        hostgroup_name                  dallas-cloud-servers
        service_description             SSH
        check_command                   check_ssh
        }

define service{
        use                             generic-service
        hostgroup_name                  dallas-cloud-servers
        service_description             PING
        check_command                   check_ping!100.0,20%!500.0,60%
        }

define servicedependency{
        hostgroup_name                  dallas-cloud-servers
        service_description             PING
        dependent_hostgroup_name        dallas-cloud-servers
        dependent_service_description   SSH
}

I didn’t want SSH tested if PING didn’t work. From this point you could add all sorts of things, but this gets us something to look at first. Restart Nagios and test the menus to make sure you see the ping and ssh checks for your nodes:

service nagios reload

All good? Okay, now let’s get to the interesting part and integrate Ganglia.

Integrate Nagios to report on Ganglia metrics

Nagios Exchange is another great place to get plug-ins for Nagios. But for our Ganglia plug-in to Nagios, look no further than the tarball you downloaded in Part 1 of this article. Assuming you uncompressed your tarball in the /tmp directory, it is only a matter of copying thecheck_ganglia.py script that is in the contrib directory:

cp /tmp/ganglia-3.1.1/contrib/check_ganglia.py \
/usr/local/nagios/libexec/

check_ganglia is a cool Python script that you run on the same server where gmetad is running (and in my case, this is the management server where Nagios is running as well). Let’s have it query the localhost on port 8649. In this way, you don’t expend network traffic by running remote commands: You get the benefits of Ganglia’s scaling techniques to do this!

If you run telnet localhost 8649,, you’ll see a ton of output on the node from data that has been collected on the nodes (provided you have Ganglia up and running as we did in Part 1). Let’s monitor a few things that Ganglia has for us.

Digging in the /var/lib/ganglia/rrds directory, you can see the metrics being measured on each host. Nice graphs are being generated, and you can analyze the metrics over time. We’re going to measure the load_one, disk_free and since we enabled IPMI temperature measurements in Part 1, let’s add that measure in as well.

Create the /usr/local/nagios/etc/dallas/ganglia-services.cfg file and add the services to it:

define servicegroup {
  servicegroup_name ganglia-metrics
  alias Ganglia Metrics
}

define command {
  command_name check_ganglia
  command_line $USER1$/check_ganglia.py -h $HOSTNAME$ -m $ARG1$ -w $ARG2$ -c $ARG3$
}

define service {
  use generic-service
  name ganglia-service
  hostgroup_name dallas-cloud-servers
  service_groups ganglia-metrics
  notifications_enabled 0
}


define service {
  use ganglia-service
  service_description load_one
  check_command check_ganglia!load_one!4!5
}


define service {
  use ganglia-service
  service_description ambient_temp
  check_command check_ganglia!AmbientTemp!20!30
}

define service {
  use ganglia-service
  service_description disk_free
  check_command check_ganglia!disk_free!10!5
}

When you restart Nagios, you now can do alerts on Ganglia metrics!

One caveat: The check_ganglia.py command only alerts when thresholds get too high. If you want it to alert when thresholds go too low (as in the case of disk_free), then you’ll need to hack the code. I changed the end of the file to look like so:

  if critical > warning:
    if value >= critical:
      print "CHECKGANGLIA CRITICAL: %s is %.2f" % (metric, value)
      sys.exit(2)
    elif value >= warning:
      print "CHECKGANGLIA WARNING: %s is %.2f" % (metric, value)
      sys.exit(1)
    else:
      print "CHECKGANGLIA OK: %s is %.2f" % (metric, value)
      sys.exit(0)
  else:
    if critical >= value:
      print "CHECKGANGLIA CRITICAL: %s is %.2f" % (metric, value)
      sys.exit(2)
    elif warning >= value:
      print "CHECKGANGLIA WARNING: %s is %.2f" % (metric, value)
      sys.exit(1)
    else:
      print "CHECKGANGLIA OK: %s is %.2f" % (metric, value)
      sys.exit(0)

Now reload Nagios:

service nagios restart

If all goes well, you should see Ganglia data being monitored by Nagios!

Figure 4. Ganglia data monitored by Nagios

Ganglia data monitored by Nagios

With Ganglia and Nagios working together, you can go crazy and monitor just about anything now. You rule the cloud!

Extending Nagios: Monitor network switches

As clouds and virtualization become a part of life, the old boundaries of the “network guys” and the “systems guys” becomes more blurred. A sysadmin who continues to ignore configuring network switches and understanding network topologies runs the risk of becoming obsolete.

So you never have to face incompleteness, I’ll show you how to extend Nagios to monitor a network switch. The advantage of using Nagios to monitor a network switch (instead of just relying on the switch vendor’s solution) is simple – you can monitor any vendor’s switch with Nagios. You’ve seen ping work, now let’s explore SNMP on the switches.

Some switches come with SNMP enabled by default. You can set it up following vendor instructions. To set up SNMP on a Cisco Switch you can follow the example I give below for my switch whose hostname is c2960g:

telnet c2960g
c2960g>enable
c2960g#configure terminal
c2960g(config)#snmp-server host 192.168.15.1 traps SNMPv1
c2960g(config)#snmp-server community public
c2960g(config)#exit
c2960g#copy running-config startup-config

Now to see what you can monitor, run snmpwalk and pipe it to a file like this:

snmpwalk -v 1 -c public c2960g

If all goes well you should see a ton of stuff passed back. You can then capture this output and look at different places to monitor.

I have another switch that I will use as an example here. When I run the snmpwalk command I see the ports and how they are labeled. I’m interested in getting the following information:

  • The MTU (IF-MIB::ifMtu.<portnumber>).
  • The speed the ports are running at (IF-MIB::ifSpeed.<port number>).
  • Whether or not the ports are up (IF-MIB::ifOperStatus.<port number>).

To monitor this I’ll create a new file, /usr/local/nagios/etc/dallas/switch-services.cfg. I have a map of my network hosts to switches so I know where everything is. You should too if you don’t already. If you really want to be a cloud, all resources should have known states.

I’ll use node x336001 as an example here. I know it’s on port 5. Here is what my file looks like:

define servicegroup {
  servicegroup_name switch-snmp
  alias Switch SNMP Services
}

define service {
  use generic-service
  name switch-service
  host_name smc001
  service_groups switch-snmp
}

define service {
  use switch-service
  service_description Port5-MTU-x336001
  check_command check_snmp!-o IF-MIB::ifMtu.5
}
define service {
  use switch-service
  service_description Port5-Speed-x336001
  check_command check_snmp!-o IF-MIB::ifSpeed.5
}

define service {
  use switch-service
  service_description Port5-Status-x336001
  check_command check_snmp!-o IF-MIB::ifOperStatus.5
}

When finished, you restart Nagios and you can see that I can now view my switch entries:

Figure 5. Monitoring switches

Monitoring switches

This is just one example of how to monitor switches. Notice that I did not set up alerting nor indicate what would constitute a critical action. You may also note that there are other options in the libexec directory that can do similar things. The check_ifoperstatus and others may do the trick as well. With Nagios there are many ways to accomplish a single task.

Extending Nagios: Job reporting to monitor TORQUE

There are lots of scripts you can write against TORQUE to determine how this queueing system is running. In this extension, assume you already have TORQUE up and running. TORQUE is a resource manager that works with schedulers like Moab and Maui. Let’s look at an open source Nagios plug-in that was written by Colin Morey.

Download this and put it into the /usr/local/nagios/libexec directory and make sure its executable. I had to modify the code a little bit by changing the directories where Nagios was installed by changing use lib "/usr/nagios/libexec"; to use lib "/usr/local/nagios/libexec";. I also had to change my $qstat = '/usr/bin/qstat' ; to wherever the qstat command is. Mine looks like this: my $qstat = '/opt/torque/x86_64/bin/qstat' ;.

Verify that it works, (My queue is called dque that I use):

[root@redhouse libexec]# ./check_pbs.pl -Q dque -tw 20 -tm 50
check_pbs.pl Critical: dque on localhost checked, Total number of jobs 
higher than 50.  Total jobs:518, Jobs Queued:518, Jobs Waiting:0, Jobs 
Halted:0 |exectime=9340us

You can use the -h option to show more things to monitor. Now let’s put it into our configuration file /usr/local/nagios/etc/dallas/torque.cfg:

define service {
        use                             generic-service
        host_name                       localhost
        service_description             TORQUE Queues
        check_command                   check_pbs!20!50
}

define command {
        command_name                    check_pbs
        command_line                    $USER1$/check_pbs.pl -Q dque 
                                         -tw $ARG1$ -tm $ARG2$
}

After restarting Nagios, the service shows up under localhost:

Figure 6. TORQUE service appears after Nagios restart

TORQUE service appears after Nagios restart

In mine, I get a critical alert because I have 518 jobs queued!

Leave a comment