Opsview Labs

Posted by mattywhi | IT, Monitoring

I had an unexpected, but much welcomed, tweet today from the team at Opsview who would like to make use of some of my writing about Opsview on their own Labs blog. I can honestly say that I wasn’t expecting the blog to be read and picked up in this way but I am pleased that I can hopefully reach a few more people with the data appearing on the Opsview Labs blog as well.

 

This is just a direct link to the HP Blog article itself but worth a read if you are looking at monitoring any HP server running ESX or ESXi. The main bit that I have always found is that you need to install the HP extensions for ESXi installed as this greatly improves what you can see from remote tools such as Insight Remote Support, Nagios/Opsview or from the vSphere client itself.

The link to the article can be found here – http://h30507.www3.hp.com/t5/Technical-Support-Services-Blog/6-Simple-Steps-to-Monitoring-ESXi-with-Insight-Remote-Support/ba-p/100789

As part of my drive to backup all my switch/firewall configs I have been trying to get RANCID to backup the remaining devices on my network. The latest devices we added to the network were a pair of Juniper EX switches that are part of an iSCSI network and until now I have not had a backup of the configs. Looking at the documentation there is a set of commands to backup other JunOS devices so thought I would give it a go.

RANCID is running on an Ubuntu 10.04 server and is running version 2.3.3. and has the jlogin scripts in place. After adding the device information to the .cloginrc file I tested jlogin to check that it could connect as root to the device – it did. When I performed rancid_run however the device did not backup as expected and Rancid hung until it timed out. Upon closer inspection the issue came down to the fact that the root account will ssh to the BSD shell on the switch and not directly to the JunOS command line. To get around this I needed to setup a new user on the switches with the correct permissions and then get this to perform the backup of the switches. The command to add the config is as follows:

You will be prompted to choose a password and then confirm it before writing it to configuration

Now you can specify the details in RANCID:

The last thing that I did was to take a copy of jlogin and jrancid from an installation of RANCID 2.3.6 and everything seems to be working as expected.

I have had RANCID setup to backup switch and firewall config for a while now but not I had always had issues with backups of my Cisco access points which I had thought was an issue with the version of RANCID or the slight differences in IOS run on the WAPs versus the Switches. Turns out after revisiting it yesterday it was more a PEBKAC or ID-10-T error on my part!

What I had in my .cloginrc file was:

when I ran bin/clogin ip_address the device would login and get me to the enable prompt as expected but when run as part of rancid_run nothing was coming back for the config. After a bit of reading and searching the solution was simple enough and it wasnt a problem with RANCID or the Aironets….

should have been used instead of the noenable line.

I also managed to get RANCID to backup the config on my Juniper EX switches but that is a story for another post

I have been playing arond with the check_equallogic Nagios plugin written by Claudio Kuenzler (http://www.claudiokuenzler.com) to monitor some performance and utilisation values for a client and I came across a bug with the code in the latest release which I thought I would share.

The latest release allows you to monitor the size of a single volume as well as a single check to monitor all volumes. I setup the check in Opsview as normal and then proceeded to configure the Host Attributes for the SAN host for each volume on the SAN (there were 75 volumes to monitor). Having added all the checks and reloading Opsview I started to see a large number of OK checks for the volumes but also a number of UNKNOWN outputs from the plugin. Closer inspection showed that when you have two volumes that have the similar names (e.g. BES01-D and DR-BES01-D) the more generic name, BES01-D in this example will match for both volumes and the script will return an unknown value. The DR-BES01-D volume returned the correct stats as the volume name only matched one entry.

Looking through the code in the plugin the line that is causing the issue is:

When it grep’s the list of volumes from the SNMP walk it returns two values and the script cannot cope so exits. After some playing around (and remembering the basics of writing bash scripts) I managed to work around the problem and changed the line to the following:

The change adds the quotation marks that are surrounding the string value that is returned from the SNMPwalk so GREP should only return the exact matches. Having updated the script and re-run the checks the UNKNOWN status was gone and the checks all returned the correct data.

Following on from my article on the SQL files bug in RSA Authentication Manager 7.1 we were looking to carry out the upgrade to the client’s server in a maintenance window last weekend however the engineer carrying out the work was unable to login to the Operations Manager console to carry out certain parts of the upgrade task.

It turns out that since RSA was installed the Security Console Super Admin account had its password changed and in the updated documentation we lost the details of the password for the Operations Console as the two passwords are not linked. In order for us to get back into the Operations Console we had to run through the following:

  1. Create a new Super Admin from the Security Console in the Internal Database
  2. Run the RSA command line utility (C:Program FilesRSA SecurityRSA Authentication ManagerutilsRSAutil) to create a new Operations Console user account

Unfortunately it wasnt that easy to complete!

Initially when we ran RSAutil as one of the admin accounts we received an error stating that only one account could run it, the account that originally installed RSA! Luckily the account was still listed and we just needed to enable this and perform a swift “runas” to bring up a command prompt as that user.

Next we sent a good bit of time running through various commands to work out how we create a new Operations Console admin account. The final command that we needed to run was as follows:

rsautil manage-oc-administrators -a create -u UserCreatedEarlier -p PasswordForUserCreatedEarlier -g OperationsConsole-Administrators NewOperationsConsoleUsername NewOperationsConsolePassword

We were now able to login to the Operations Console using the account we created. Now to find another maintenance window to patch the RSA server

I’ve passed

Following a long break from completing my MCSA: Messaging in Server 2003 I have finally got round to updating this for the modern era and upgraded this first to MCTS in Windows Server 2008 and finally this afternoon completed my 70-647 exam to attain the qualification of Microsoft Certified IT Professional: Enterprise Administrator.

For those of you with an MCSA in Windows Server 2003 the upgrade is done with the following exams:

  • 70-648 – TS: Upgrading from Windows Server 2003 MCSA to, Windows Server 2008, Technology Specializations. This is equivalent to completing the following two exams 70-640 and 70-642
  • 70-680 – TS: Windows 7, Configuring. This is the client exam required as part of the qualification
  • 70-643 – TS: Windows Server 2008 Applications Infrastructure, Configuring
  • 70-647 – Pro: Windows Server 2008, Enterprise Administrator

Once Microsoft confirm it on the MCP site I will update the qualifications links on the left with the new logo.

For a while I have been seeing a daily ODW_STATUS_WARNING about no updates since 03:59:59 on my master opsview server. I was 90% sure this was due to the load that I put on the server (load average sits around 6 and goes up to 13 at certain times of the day) but still got bored of running cleanup_import and then import_runtime -i 1.

I started off by manually clearing out all but 1 week of data from the runtime database (this is run as part of opsview_master_housekeep for various tables) and this didnt resolve the issue. In the end I modified my cron table so that the rc.opsview cron_daily task runs 30 minutes later (at 41 minutes past the hour instead of 11 minutes past. Since changing that I seem to have had no further re-occurrences of the No update prompt.

I am aware that each time I update Opsview I am going to have to make this change until I manage to move the databases to their own host and rebuild the master server onto new hardware but its a workaround for now!

For reference the crontab now looks like:

Following on from my post last night about the Windows Updates check on MonitoringExchange a colleague reminded me that we acutally modified the script from there as we weren’t looking for the names of updates to be listed but simply to get the total number of updates that are outstanding. The modified version of the script is listed below for reference and the source for this is at the following URL: https://www.monitoringexchange.org/inventory/Check-Plugins/Operating-Systems/Windows-NRPE/Check-Windows-Updates

NSClient 0.3.9 was released earlier this month and from the looks of the change log should be a good replacement for 0.3.8. (http://www.nsclient.org/nscp/blog/Blog-2011-07-05). As with previous releases there are both 32-bit and 64-bit variants and the option for an MSI package or for a ZIP download.

Some things I have noticed in the new release (these may have been in 0.3.8 but I never noticed them) are two new external scripts to check Printer status and check Windows Updates. I have been using my own Windows Update script (https://www.monitoringexchange.org/inventory/Check-Plugins/Operating-Systems/Windows-NRPE/Check-Windows-Updates) as I found the ones that query WMI take longer than the default 10 seconds for the script to run without timing out. Giving the bundled script a go it did a good job of outputting some useful information about the Windows Updates however it still took too long to run so I doubt that I will be using this in its current form. The output when running it on my workstation is as follows:

The Printer check also ran through my list of installed printers and came out with an “Unknown” status and the details listed didnt match what Windows was saying so again probably wont be using this in its current format and more likely monitor the printers individually with SNMP based checks directly to the printers.

There are some good additions to the list of modules. CheckTaskSched looks to be a good addition to make sure that those scheduled tasks you have left to run on your server are running as expected and not left stuck in a running state (or didn’t exit with error code 0). CheckFile and CheckFile2 have been amalgamated into the CheckFiles module which will allow you to check a single file but also multiple files for certain criteria. The link above gives examples on checking file versions, line counts, file sizes etc.

For a full list of changes the change log can be found here: http://www.nsclient.org/nscp/blog/Blog-2011-07-05