Monitoring Alarm Status on Juniper EX Switches

I am in the process of installing a number of Juniper EX2200, EX3200 and EX4200 switches for a client and as part of the setup need to be able to monitor the switches for any alarms  (eg Switch Management interface down or Switch booted from Backup Partition) and have them dealt with accordingly.

Having a look at the SNMP OID tree for the EX switches I came across the following useful table

http://www.oidview.com/mibs/2636/JUNIPER-ALARM-MIB.html

Object Name Object Identifier
jnxAlarms jnxAlarms 1.3.6.1.4.1.2636.3.4
jnxCraftAlarms jnxCraftAlarms 1.3.6.1.4.1.2636.3.4.2
jnxAlarmRelayMode jnxAlarmRelayMode 1.3.6.1.4.1.2636.3.4.2.1
jnxYellowAlarms jnxYellowAlarms 1.3.6.1.4.1.2636.3.4.2.2
jnxYellowAlarmState jnxYellowAlarmState 1.3.6.1.4.1.2636.3.4.2.2.1
jnxYellowAlarmCount jnxYellowAlarmCount 1.3.6.1.4.1.2636.3.4.2.2.2
jnxYellowAlarmLastChange jnxYellowAlarmLastChange 1.3.6.1.4.1.2636.3.4.2.2.3
jnxRedAlarms jnxRedAlarms 1.3.6.1.4.1.2636.3.4.2.3
jnxRedAlarmState jnxRedAlarmState 1.3.6.1.4.1.2636.3.4.2.3.1
jnxRedAlarmCount jnxRedAlarmCount 1.3.6.1.4.1.2636.3.4.2.3.2
jnxRedAlarmLastChange jnxRedAlarmLastChange 1.3.6.1.4.1.2636.3.4.2.3.3

I have used the jnxRedAlarmCount and jnxYellowAlarmCount oid values as basic Opsview SNMP Service Checks to give me an initial overview but in the long term will be looking to combine this into a full service check script that can be used to check a number of different things.

The setup of the Service Check in Opsview is fairly simple and below are screenshots of the config that I have for each service check.

All you need to configure on your hosts is the SNMP community string and you can apply these checks individually or via a Host Template.

Once I performed a reload I could see the following in Opsview for one of my switches:

A bit of inspection showed that the Red Alarm was for the Management Interface being down (but wasnt being used on this switch) and the Yellow alarm was due to not setting a rescue configuration. I cleared the alarms by isuing the following commands

Now when I refresh the checks in Opsview I get an OK state for both checks

Opsview – patch for check_route plugin

I was playing around with the check_route plugin and noticed a few issues with it not running. In order to get it to work on my Opsview boxes I had to install a new package, change some settings on the traceroute program and then make a patch in the script itself.

First thing you need to do is download the traceroute package if its not already installed

Once installed you will find that the plugin will fail and show the following error:

Googling the first line I found that you have to setuid root for the traceroute binary

Trying the plugin again you get the following error

To get around this you need the plugin to ignore the first line of the output from the traceroute which can be done with the following patch

http://snipt.net/mattywhi/opsview-check_route-diff/

Now the script runs as expected and you get the following output

 

Opsview Labs

I had an unexpected, but much welcomed, tweet today from the team at Opsview who would like to make use of some of my writing about Opsview on their own Labs blog. I can honestly say that I wasn’t expecting the blog to be read and picked up in this way but I am pleased that I can hopefully reach a few more people with the data appearing on the Opsview Labs blog as well.

 

Monitoring HP ESXi Hosts using Insight Remote Support

This is just a direct link to the HP Blog article itself but worth a read if you are looking at monitoring any HP server running ESX or ESXi. The main bit that I have always found is that you need to install the HP extensions for ESXi installed as this greatly improves what you can see from remote tools such as Insight Remote Support, Nagios/Opsview or from the vSphere client itself.

The link to the article can be found here – http://h30507.www3.hp.com/t5/Technical-Support-Services-Blog/6-Simple-Steps-to-Monitoring-ESXi-with-Insight-Remote-Support/ba-p/100789

check_equallogic volumes bug

I have been playing arond with the check_equallogic Nagios plugin written by Claudio Kuenzler (http://www.claudiokuenzler.com) to monitor some performance and utilisation values for a client and I came across a bug with the code in the latest release which I thought I would share.

The latest release allows you to monitor the size of a single volume as well as a single check to monitor all volumes. I setup the check in Opsview as normal and then proceeded to configure the Host Attributes for the SAN host for each volume on the SAN (there were 75 volumes to monitor). Having added all the checks and reloading Opsview I started to see a large number of OK checks for the volumes but also a number of UNKNOWN outputs from the plugin. Closer inspection showed that when you have two volumes that have the similar names (e.g. BES01-D and DR-BES01-D) the more generic name, BES01-D in this example will match for both volumes and the script will return an unknown value. The DR-BES01-D volume returned the correct stats as the volume name only matched one entry.

Looking through the code in the plugin the line that is causing the issue is:

When it grep’s the list of volumes from the SNMP walk it returns two values and the script cannot cope so exits. After some playing around (and remembering the basics of writing bash scripts) I managed to work around the problem and changed the line to the following:

The change adds the quotation marks that are surrounding the string value that is returned from the SNMPwalk so GREP should only return the exact matches. Having updated the script and re-run the checks the UNKNOWN status was gone and the checks all returned the correct data.

Opsview: “ODW_STATUS WARNING – No update since” Workaround

For a while I have been seeing a daily ODW_STATUS_WARNING about no updates since 03:59:59 on my master opsview server. I was 90% sure this was due to the load that I put on the server (load average sits around 6 and goes up to 13 at certain times of the day) but still got bored of running cleanup_import and then import_runtime -i 1.

I started off by manually clearing out all but 1 week of data from the runtime database (this is run as part of opsview_master_housekeep for various tables) and this didnt resolve the issue. In the end I modified my cron table so that the rc.opsview cron_daily task runs 30 minutes later (at 41 minutes past the hour instead of 11 minutes past. Since changing that I seem to have had no further re-occurrences of the No update prompt.

I am aware that each time I update Opsview I am going to have to make this change until I manage to move the databases to their own host and rebuild the master server onto new hardware but its a workaround for now!

For reference the crontab now looks like:

Nagios Windows Updates check

Following on from my post last night about the Windows Updates check on MonitoringExchange a colleague reminded me that we acutally modified the script from there as we weren’t looking for the names of updates to be listed but simply to get the total number of updates that are outstanding. The modified version of the script is listed below for reference and the source for this is at the following URL: https://www.monitoringexchange.org/inventory/Check-Plugins/Operating-Systems/Windows-NRPE/Check-Windows-Updates

NSClient 0.3.9 released

NSClient 0.3.9 was released earlier this month and from the looks of the change log should be a good replacement for 0.3.8. (http://www.nsclient.org/nscp/blog/Blog-2011-07-05). As with previous releases there are both 32-bit and 64-bit variants and the option for an MSI package or for a ZIP download.

Some things I have noticed in the new release (these may have been in 0.3.8 but I never noticed them) are two new external scripts to check Printer status and check Windows Updates. I have been using my own Windows Update script (https://www.monitoringexchange.org/inventory/Check-Plugins/Operating-Systems/Windows-NRPE/Check-Windows-Updates) as I found the ones that query WMI take longer than the default 10 seconds for the script to run without timing out. Giving the bundled script a go it did a good job of outputting some useful information about the Windows Updates however it still took too long to run so I doubt that I will be using this in its current form. The output when running it on my workstation is as follows:

The Printer check also ran through my list of installed printers and came out with an “Unknown” status and the details listed didnt match what Windows was saying so again probably wont be using this in its current format and more likely monitor the printers individually with SNMP based checks directly to the printers.

There are some good additions to the list of modules. CheckTaskSched looks to be a good addition to make sure that those scheduled tasks you have left to run on your server are running as expected and not left stuck in a running state (or didn’t exit with error code 0). CheckFile and CheckFile2 have been amalgamated into the CheckFiles module which will allow you to check a single file but also multiple files for certain criteria. The link above gives examples on checking file versions, line counts, file sizes etc.

For a full list of changes the change log can be found here: http://www.nsclient.org/nscp/blog/Blog-2011-07-05

Opsview: Host Attributes and Keywords

Having been an avid Nagios/Opsview user for a while I am always keen to see new features that make my life of defining and managing systems easier. I had been meaning to try out the host attributes feature of Opsview for a while to redefine the way I monitor various “generic” features on my infrastructure. Up until now I have had to create an exception for a host that I want to monitor in a slightly different way and remembering what did/didnt have exceptions was never the easiest thing to do.

This has all changed with the Host Attributes feature in Opsview. I can now define a single service check that will take a number of values (currently Opsview 3.7.2 will only let you define one however looking at the SQL database there is capacity for 9 arguments. A forum post from Ton Voon has revealed a patch to the host-attributes tab that allows you to define 4 attributes which should be released in an upcoming release – 3.7.3 maybe). This means that I can define a host attribute (e.g. DISK) and then set in this the partition/disk name and the warning/critical values in different arguments to make sure that I can reduce the number of custom service checks or exceptions that I need to define.

I have managed to abstract my Disk space checks and also some checks for Exchange Information Store sizes across my organisation. I plan to try and further abstract other generalised items of monitoring (e.g. Windows Services, Performance counters etc).

Once I had created these checks I needed to add in a viewport to display the status of my Information Stores. In the past this used to be setup individually on each host and service check manually. In the latest release its possible to create a new keyword and then add in the host/services that you want from the Keywords tab. This has made the process of making new views/displays easier and made the monitoring much simpler.

When I get some time I will put up some pictures to go with this article and expand on my ability to monitor network interfaces with the latest version of Opsview.

Opsview users – beware javascript-common package

I have just come across an interesting issue with my production Opsview server where the web interface was loading successfully on http://<server>:3000 however http://<server> through the apache proxy was not working following an upgrade from Ubuntu 8.04 to 10.04

The upgrade process went smoothly and everything looked OK having reinstalled Opsview (the upgrade process will uninstall the package but the data is kept in the databases) except for the fact that I couldnt expand any of the menus across the top of Opsview and a small box had appeared under the search bar.

Opsview 3.7.0 with Javascript Error

A bit more research in my browser showed the following errors:

Opsview errors

Following on from this I popped a quick email to the opsview-users distribution group. to find out if they were aware of any issues with the upgrade process. I reviewed the Apache proxy config and replaced this with the stock one from the new Opsview install. This didnt help.

Next I ran through disbaling and re-enabling the three proxy modules and reloading apache a few times and still no joy.

Feedback from the mailing list suggested that the proxy config was correct but to try accessing the javascript file http://<server>/javascript/prototype.js (this returned a 404 error) and to also look at the apache error logs at the same time.

The logs from apache gave me the following:

I would expect the path for the javascript to be /usr/local/nagios/share/javascript/… and not just /usr/share/javascript. I double checked my apache config and ran through all the configuration files that were included. excluded the /etc/apache2/conf.d directory and reloaded Apache. The result… Opsview loaded and displayed correctly:

Going back through the different files in the directory I came across javascript-common.conf which has the following code in it:

I removed the symlink, re-enabled the conf.d directory in the apache config and all looked good.

Having a quickl look round I couldnt find any reason for the package being installed on my machine so I removed it and restarted apache followed by an apt-get check to see if there were any broken dependencies and there were none.

Upshot of all of this… Unless you want all your Javascript to be in one location dont install the javascript-common package.