check_equallogic volumes bug

I have been playing arond with the check_equallogic Nagios plugin written by Claudio Kuenzler (http://www.claudiokuenzler.com) to monitor some performance and utilisation values for a client and I came across a bug with the code in the latest release which I thought I would share.

The latest release allows you to monitor the size of a single volume as well as a single check to monitor all volumes. I setup the check in Opsview as normal and then proceeded to configure the Host Attributes for the SAN host for each volume on the SAN (there were 75 volumes to monitor). Having added all the checks and reloading Opsview I started to see a large number of OK checks for the volumes but also a number of UNKNOWN outputs from the plugin. Closer inspection showed that when you have two volumes that have the similar names (e.g. BES01-D and DR-BES01-D) the more generic name, BES01-D in this example will match for both volumes and the script will return an unknown value. The DR-BES01-D volume returned the correct stats as the volume name only matched one entry.

Looking through the code in the plugin the line that is causing the issue is:

volarray=$(snmpwalk -v 2c -c ${community} ${host} 1.3.6.1.4.1.12740.5.1.7.1.1.4 | grep -n ${volume} | cut -d : -f1)

When it grep’s the list of volumes from the SNMP walk it returns two values and the script cannot cope so exits. After some playing around (and remembering the basics of writing bash scripts) I managed to work around the problem and changed the line to the following:

volarray=$(eval snmpwalk -v 2c -c ${community} ${host} 1.3.6.1.4.1.12740.5.1.7.1.1.4 | grep -n ""${volume}"" | cut -d : -f1)

The change adds the quotation marks that are surrounding the string value that is returned from the SNMPwalk so GREP should only return the exact matches. Having updated the script and re-run the checks the UNKNOWN status was gone and the checks all returned the correct data.

ESXi enabling SNMP

Last night I wrote an article about how to monitor the health of an ESXi server (link here) and I wanted to explain a bit more about my findings with SNMP on an ESXi host.

My goal with the monitoring was to use the check_dell and check_hp commands I have found for Nagios/Opsview to monitor the hardware that ESX is running on. The ESXi installs I am working with are using the Dell and HP management agents installed so I thought that everything would work out of the box and enabling SNMP would let me query the different aspects of the hardware.

The official line from VMWare was that SNMP is not enabled on ESXi and with no console cant be enabled. I knew however, having read a recent post on the TechHead blog (link here) that you could see the snmp.xml file and this shows that it is not enabled which made me think it must be possible to enable it. I was right.

A quick google came up with this article and I had a look and this was a fairly simple process to run:

First you need to enter the “unsupported” console on your ESXi server. To do this press Ctrl+Alt+F1 at your ESX console, now type the word unsupported (N.B. you will not see the text on your screen) and press Enter. If all goes well you should see a password prompt, enter your root password here and you should get a warning you are entering a mode that should only be enabled with VMWare support and be presented with a console.

type the following command to enter the VI text editor and start to modify the snmp.xml file:

vi /etc/vmware/snmp.xml

You should see a single line of text at the top of the screen which is the contents of the xml file. Press i to enter Insert mode and change

<enabled>false</enabled>

to

<enabled>true</enabled>

Then scroll across and add the community name you want the SNMP agent to respond on and place this between the following tags

<communities></communities>

so it should look like

<communities>public</communities>

I wasnt interested in setting up SNMP traps so left this blank and quit the VI editor by press Esc to exit insert mode and then :wq to write the file and quit the editor.

Finally we need to restart the services on the esx host which can be done with the following command

/sbin/services.sh restart

Great, SNMP is now enabled so I should be able to get the information from the HP/Dell management agents that I want. Wrong. My snmpwalk of the host provided little to no useful information about what I was trying to unlock.

opsview@LON-SVR-MON1:~$ snmpwalk -v 2c -c public 10.9.0.65
SNMPv2-MIB::sysDescr.0 = STRING: VMware ESX 4.0.0 build-219382 VMware, Inc. x86_64
SNMPv2-MIB::sysObjectID.0 = OID: SNMPv2-SMI::enterprises.6876.4.1
DISMAN-EVENT-MIB::sysUpTimeInstance = Timeticks: (6061646) 16:50:16.46
SNMPv2-MIB::sysContact.0 = STRING: not set
SNMPv2-MIB::sysName.0 = STRING: lon-svr-esx2.domain.local
SNMPv2-MIB::sysLocation.0 = STRING: not set
SNMPv2-MIB::sysServices.0 = INTEGER: 72
SNMPv2-MIB::sysORLastChange.0 = Timeticks: (0) 0:00:00.00
SNMPv2-MIB::sysORID.1 = OID: SNMPv2-MIB::snmpMIB
SNMPv2-MIB::sysORID.2 = OID: IF-MIB::ifMIB
SNMPv2-MIB::sysORID.3 = OID: SNMPv2-SMI::enterprises.6876.1.10
SNMPv2-MIB::sysORID.4 = OID: SNMPv2-SMI::enterprises.6876.2.10
SNMPv2-MIB::sysORID.5 = OID: SNMPv2-SMI::enterprises.6876.3.10
SNMPv2-MIB::sysORDescr.1 = STRING: SNMPv2-MIB, RFC 3418
SNMPv2-MIB::sysORDescr.2 = STRING: IF-MIB, RFC 2863
SNMPv2-MIB::sysORDescr.3 = STRING: VMWARE-SYSTEM-MIB, REVISION 200801120000Z
SNMPv2-MIB::sysORDescr.4 = STRING: VMWARE-VMINFO-MIB, REVISION 200810230000Z
SNMPv2-MIB::sysORDescr.5 = STRING: VMWARE-RESOURCES-MIB, REVISION 200810150000Z
SNMPv2-MIB::sysORUpTime.1 = Timeticks: (0) 0:00:00.00
SNMPv2-MIB::sysORUpTime.2 = Timeticks: (0) 0:00:00.00
SNMPv2-MIB::sysORUpTime.3 = Timeticks: (0) 0:00:00.00
SNMPv2-MIB::sysORUpTime.4 = Timeticks: (0) 0:00:00.00
SNMPv2-MIB::sysORUpTime.5 = Timeticks: (0) 0:00:00.00
IF-MIB::ifNumber.0 = INTEGER: 4
IF-MIB::ifDescr.1 = STRING: Device vmnic0 at 02:00.0 bnx2
IF-MIB::ifDescr.2 = STRING: Device vmnic1 at 02:00.1 bnx2
IF-MIB::ifDescr.3 = STRING: Device vmnic2 at 03:00.0 bnx2
IF-MIB::ifDescr.4 = STRING: Device vmnic3 at 03:00.1 bnx2
IF-MIB::ifType.1 = INTEGER: ethernetCsmacd(6)
IF-MIB::ifType.2 = INTEGER: ethernetCsmacd(6)
IF-MIB::ifType.3 = INTEGER: ethernetCsmacd(6)
IF-MIB::ifType.4 = INTEGER: ethernetCsmacd(6)
IF-MIB::ifMtu.1 = INTEGER: 1500
IF-MIB::ifMtu.2 = INTEGER: 1500
IF-MIB::ifMtu.3 = INTEGER: 1500
IF-MIB::ifMtu.4 = INTEGER: 1500
IF-MIB::ifSpeed.1 = Gauge32: 1000000000
IF-MIB::ifSpeed.2 = Gauge32: 1000000000
IF-MIB::ifSpeed.3 = Gauge32: 0
IF-MIB::ifSpeed.4 = Gauge32: 0
IF-MIB::ifPhysAddress.1 = STRING: 18:a9:5:4e:a7:1c
IF-MIB::ifPhysAddress.2 = STRING: 18:a9:5:4e:a7:1e
IF-MIB::ifPhysAddress.3 = STRING: 18:a9:5:4e:a7:20
IF-MIB::ifPhysAddress.4 = STRING: 18:a9:5:4e:a7:22
IF-MIB::ifAdminStatus.1 = INTEGER: up(1)
IF-MIB::ifAdminStatus.2 = INTEGER: up(1)
IF-MIB::ifAdminStatus.3 = INTEGER: up(1)
IF-MIB::ifAdminStatus.4 = INTEGER: up(1)
IF-MIB::ifOperStatus.1 = INTEGER: up(1)
IF-MIB::ifOperStatus.2 = INTEGER: up(1)
IF-MIB::ifOperStatus.3 = INTEGER: down(2)
IF-MIB::ifOperStatus.4 = INTEGER: down(2)
IF-MIB::ifLastChange.1 = Timeticks: (0) 0:00:00.00
IF-MIB::ifLastChange.2 = Timeticks: (0) 0:00:00.00
IF-MIB::ifLastChange.3 = Timeticks: (0) 0:00:00.00
IF-MIB::ifLastChange.4 = Timeticks: (0) 0:00:00.00
SNMPv2-MIB::snmpInPkts.0 = Counter32: 187
SNMPv2-MIB::snmpInBadVersions.0 = Counter32: 0
SNMPv2-MIB::snmpInBadCommunityNames.0 = Counter32: 0
SNMPv2-MIB::snmpInBadCommunityUses.0 = Counter32: 0
SNMPv2-MIB::snmpInASNParseErrs.0 = Counter32: 0
SNMPv2-MIB::snmpEnableAuthenTraps.0 = INTEGER: disabled(2)
SNMPv2-MIB::snmpSilentDrops.0 = Counter32: 0
SNMPv2-MIB::snmpProxyDrops.0 = Counter32: 0

My thoughts now are simple. SNMP is not enabled in ESXi for the reason that there is not much there to query and you can use the CIM queries that I mentioned in the previous post to look at this instead.

Monitoring ESXi Server health using Nagios/Opsview

As part of a project I am currently working on I have a requirement to check that my clients’ infrastructure is working to the best of its ability. Whilst we perform regular checks to ensure the sites are running as expected we don’t currently have an easy way to check the health of the ESX hosts that the virtual servers run on. Until now.

I had spent a lot of time trying to “hack” SNMP to be enabled on the ESXi boxes which involved editing the snmp.xml file in the “unsupported” console on the host but after enabling this found that it didnt give me the data I was looking for to run my checks against. Looking a bit further I found a python script which queries the CIM service on the ESX host to find out whether the hardware is working as expected. The script uses the CIM service to check the ESX Health Status and report back to your monitoring platform what the current status of the host is.

Installation is fairly straightforward. The following details are for an Opsview install running on Ubuntu 8.04LTS server but should be easily adaptable to any installation if needs be.

First login to your server as normal and download the latest version of the pywbem module (http://archive.ubuntu.com/ubuntu/pool/universe/p/pywbem/pywbem_0.7.0.orig.tar.gz)

opsview@LON-SVR-MON1:~$ wget http://archive.ubuntu.com/ubuntu/pool/universe/p/pywbem/pywbem_0.7.0.orig.tar.gz

Once you have downloaded the module extract and run the python installer as root

opsview@LON-SVR-MON1:~$ tar -xzf pywbem_0.7.0.orig.tar.gz
opsview@LON-SVR-MON1:~$ cd pywbem-0.7.0/
opsview@LON-SVR-MON1:~/pywbem-0.7.0$ sudo python setup.py install

Next you need to download the check_esx_wbem.py script (http://communities.vmware.com/docs/DOC-7170) and place it in your libexec folder

opsview@LON-SVR-MON1:~/pywbem-0.7.0$ cd /usr/local/nagios/libexec/
opsview@LON-SVR-MON1:/usr/local/nagios/libexec# wget http://communities.vmware.com/servlet/JiveServlet/downloadBody/7170-102-5-4233/check_esx_wbem.py
opsview@LON-SVR-MON1:/usr/local/nagios/libexec# sudo chown nagios:nagios check_esx_wbem.py
opsview@LON-SVR-MON1:/usr/local/nagios/libexec# sudo chmod a+x check_esx_wbem.py

You can test this from the command line using the following command

opsview@LON-SVR-MON1:/usr/local/nagios/libexec# ./check_esx_wbem.py https://10.9.0.65:5989 root Password

In the case above I received the following output but if everything is working as expected the script should return “OK”

WARNING : Power Supply 3 Power Supplies<br>CRITICAL : Power Supply 2 Power Supply 2: Failure detected<br>

Now we have confirmed the script is running we need to add it to Opsview. The first step here is to reload Opsview to pickup the new plugin. Once complete goto Configuration -> Service Checks and Create New Service Check. Setup your check in a similar way to the image below (remember to substitute “root” and “Password” with a valid username and password to login to your ESX host

Save this service check and then apply this to your ESX hosts. If you have multiple ESX hosts that have different username and passwords then you don’t need to create multiple Service Checks as the later versions of Opsview let you specify exceptions when you configure the check for a host

Once you have configured this reload Opsview and wait for Opsview to start checking the ESX server(s). Below is the screenshot from my server with its disconnected PSU

This should now allow you  to keep an eye on your ESX hosts alongside the rest of your network monitoring system.