Opsview – patch for check_route plugin

I was playing around with the check_route plugin and noticed a few issues with it not running. In order to get it to work on my Opsview boxes I had to install a new package, change some settings on the traceroute program and then make a patch in the script itself.

First thing you need to do is download the traceroute package if its not already installed

sudo apt-get install traceroute

Once installed you will find that the plugin will fail and show the following error:

The specified type of tracerouting is allowed for superuser only
Can't use an undefined value as an ARRAY reference at ./check_route line 129.

Googling the first line I found that you have to setuid root for the traceroute binary

chmod u+s /usr/sbin/traceroute

Trying the plugin again you get the following error

Use of uninitialized value $time_units in string eq at ./check_route line 114.
ROUTE UNKNOWN - Cannot cope with line 'traceroute to 8.8.8.8 (8.8.8.8), 30 hops max, 60 byte packets'

To get around this you need the plugin to ignore the first line of the output from the traceroute which can be done with the following patch

http://snipt.net/mattywhi/opsview-check_route-diff/

Now the script runs as expected and you get the following output

ROUTE OK - Time taken is 145.895 ms | total_time=145.895ms;5000;100000 hops=14;; route_change=0;;


 

Configuring Juniper SSG Firewalls to failover between Internet connections

I have been working with the Netscreen, and then Juniper firewall products for the past five years and am still learning new and interesting features they offer. One thing that I have been configuring more and more recently are secondary Internet connections and fail-over between them for clients. This post runs through the steps required to configure an SSG firewall to use track-IP to monitor IP addresses on the Internet and then automatically fail-over and fail-back an Internet connection.

The first thing we need to do is move the interfaces that will contain the Internet connections so each is in their own virtual router. This will allow us to have an active default route for each connection and they can behave independently of each other.

set interface ethernet0/0 zone null
set interface ethernet0/1 zone null
set zone untrust vrouter untrust-vr
set vrouter name adsl-vr
set zone name BackupUntrust
set zone BackupUntrust vrouter adsl-vr
set interface ethernet0/0 zone Untrust
set interface ethernet0/1 zone BackupUntrust

For this example I am using the 192.0.2.0/24 address range for my WAN connections – this was defined by the IETF as a subnet to be used for testing and documentation in RFC 5735. As these interfaces are both public facing I am also going to restrict the management to secure protocols only

set interface ethernet0/0 ip 192.0.2.2/29
set interface ethernet0/0 route
set interface ethernet0/0 manage-ip 192.0.2.3
set interface ethernet0/0 manage ping
set interface ethernet0/0 manage ssh
set interface ethernet0/0 manage ssl
set interface ethernet0/1 ip 192.0.2.10/29
set interface ethernet0/1 route
set interface ethernet0/1 manage-ip 192.0.2.11
set interface ethernet0/1 manage ping
set interface ethernet0/1 manage ssh
set interface ethernet0/1 manage ssl

Now we need to setup the default routes out of each virtual router so that each connection can communicate with the rest of the Internet

set vrouter untrust-vr
set route 0.0.0.0/0 interface ethernet0/0 gateway 192.0.2.1
exit
set vrouter adsl-vr
set route 0.0.0.0/0 interface ethernet0/1 gateway 192.0.2.9
exit

We need to ensure that our internal users are able to route to both the untrust-vr and adsl-vr. This can be done by exporting the default static route from the untrust-vr and adsl-vr

set vrouter "untrust-vr"
set access-list 1
set access-list 1 permit ip 0.0.0.0/0 1
set route-map name "untrust-vr_export" permit 1
set match ip 1
set preserve preference
exit
set export-to vrouter "trust-vr" route-map "untrust-vr_export" protocol static
set vrouter "adsl-vr"
set access-list 1
set access-list 1 permit ip 0.0.0.0/0 1
set route-map name "adsl-vr_export" permit 1
set match ip 1
exit
set export-to vrouter "trust-vr" route-map "adsl-vr_export" protocol static

This will import both default routes to the trust-vr and set maintain the preference of the export from the untrust-vr at 20 whilst setting the metric of the adsl-vr export to 140.

Now that our users can connect to the Internet we need to make sure that should there be an issue with the primary internet circuit the backup circuit can be used for Internet access. This is achieved by using track-ip to monitor a number of hosts on the Internet and should they become unreachable shut the interface down.

In this example we are using the IP address of some of the root DNS servers as the addresses the firewall will use to check for a valid Internet connection but they could be any IP addresses that you expect to remain online and will respond to PING requests

set interface ethernet0/0 threshold 75
set interface ethernet0/0 monitor track-ip ip
set interface ethernet0/0 monitor track-ip threshold 75
set interface ethernet0/0 monitor track-ip weight 75
set interface ethernet0/0 monitor track-ip ip 192.58.128.30 threshold 25
set interface ethernet0/0 monitor track-ip ip 192.58.128.30 weight 25
set interface ethernet0/0 monitor track-ip ip 192.36.148.17 threshold 25
set interface ethernet0/0 monitor track-ip ip 192.36.148.17 weight 25
set interface ethernet0/0 monitor track-ip ip 193.0.14.129 threshold 25
set interface ethernet0/0 monitor track-ip ip 193.0.14.129 weight 25

This will PING the three addresses every second and will consider the address to have failed when the test has failed 25 times consecutively. Summing these three failures together will hit the weight and threshold limits of 75 needed to shut down the interface.

UPDATE: Since this was written Juniper released newer firmware that allowed you to specify the interface threshold for failure in addition to the Track-IP threshold. This would mean that track-ip would fail at 75 but the interface default was set to 255 for failover, the config above has been amended accordingly to reflect this change in behaviour.

If you want to test the status of the track-ip monitoring you can issue the following commands

get interface ethernet0/0 monitor
get interface ethernet0/0 monitor track-ip

and you will be able to see the failure statistics as well as whether the interface is failed or not.

When the interface is shut down the default route no longer becomes valid in the untrust-vr and will be deleted in the trust-vr leaving the export from the adsl-vr active and Internet traffic will continue to function as normal. In the background, the management address on the primary connection will continue to poll the IP addresses configured and when they become available the weight and threshold will be below the failure values, the interface comes back up and the untrust-vr route export re-appaers in the trust-vr.

The only other thing to consider here is inbound services on the backup line such as MX records to permit mail delivery to a MIP or VIP on the secondary circuit

If this is all configured correctly the only things the user should notice is that any websites/services that login and use session data (eg online banking) will need to login after fail-over or fail-back as their existing session will no longer be valid.

The only remaining task is to commit the changes you have made to flash

save

 

Opsview Labs

I had an unexpected, but much welcomed, tweet today from the team at Opsview who would like to make use of some of my writing about Opsview on their own Labs blog. I can honestly say that I wasn’t expecting the blog to be read and picked up in this way but I am pleased that I can hopefully reach a few more people with the data appearing on the Opsview Labs blog as well.

 

Monitoring HP ESXi Hosts using Insight Remote Support

This is just a direct link to the HP Blog article itself but worth a read if you are looking at monitoring any HP server running ESX or ESXi. The main bit that I have always found is that you need to install the HP extensions for ESXi installed as this greatly improves what you can see from remote tools such as Insight Remote Support, Nagios/Opsview or from the vSphere client itself.

The link to the article can be found here – http://h30507.www3.hp.com/t5/Technical-Support-Services-Blog/6-Simple-Steps-to-Monitoring-ESXi-with-Insight-Remote-Support/ba-p/100789

RANCID: Backing up Juniper EX switches

As part of my drive to backup all my switch/firewall configs I have been trying to get RANCID to backup the remaining devices on my network. The latest devices we added to the network were a pair of Juniper EX switches that are part of an iSCSI network and until now I have not had a backup of the configs. Looking at the documentation there is a set of commands to backup other JunOS devices so thought I would give it a go.

RANCID is running on an Ubuntu 10.04 server and is running version 2.3.3. and has the jlogin scripts in place. After adding the device information to the .cloginrc file I tested jlogin to check that it could connect as root to the device – it did. When I performed rancid_run however the device did not backup as expected and Rancid hung until it timed out. Upon closer inspection the issue came down to the fact that the root account will ssh to the BSD shell on the switch and not directly to the JunOS command line. To get around this I needed to setup a new user on the switches with the correct permissions and then get this to perform the backup of the switches. The command to add the config is as follows:

set system login user adminusername class super-user authentication plain-text-password

You will be prompted to choose a password and then confirm it before writing it to configuration

commit and-quit

Now you can specify the details in RANCID:

add user ip_address {username}
add password ip_address {password}
add method ip_address {ssh}

The last thing that I did was to take a copy of jlogin and jrancid from an installation of RANCID 2.3.6 and everything seems to be working as expected.

RANCID: Issue backing up Cisco Aironet access points

I have had RANCID setup to backup switch and firewall config for a while now but not I had always had issues with backups of my Cisco access points which I had thought was an issue with the version of RANCID or the slight differences in IOS run on the WAPs versus the Switches. Turns out after revisiting it yesterday it was more a PEBKAC or ID-10-T error on my part!

What I had in my .cloginrc file was:

add user ip_address {username}
add password ip_address {password}
add method ip_address {ssh}
add noenable ip_address 1

when I ran bin/clogin ip_address the device would login and get me to the enable prompt as expected but when run as part of rancid_run nothing was coming back for the config. After a bit of reading and searching the solution was simple enough and it wasnt a problem with RANCID or the Aironets….

add autoenable ip_address 1

should have been used instead of the noenable line.

I also managed to get RANCID to backup the config on my Juniper EX switches but that is a story for another post

check_equallogic volumes bug

I have been playing arond with the check_equallogic Nagios plugin written by Claudio Kuenzler (http://www.claudiokuenzler.com) to monitor some performance and utilisation values for a client and I came across a bug with the code in the latest release which I thought I would share.

The latest release allows you to monitor the size of a single volume as well as a single check to monitor all volumes. I setup the check in Opsview as normal and then proceeded to configure the Host Attributes for the SAN host for each volume on the SAN (there were 75 volumes to monitor). Having added all the checks and reloading Opsview I started to see a large number of OK checks for the volumes but also a number of UNKNOWN outputs from the plugin. Closer inspection showed that when you have two volumes that have the similar names (e.g. BES01-D and DR-BES01-D) the more generic name, BES01-D in this example will match for both volumes and the script will return an unknown value. The DR-BES01-D volume returned the correct stats as the volume name only matched one entry.

Looking through the code in the plugin the line that is causing the issue is:

volarray=$(snmpwalk -v 2c -c ${community} ${host} 1.3.6.1.4.1.12740.5.1.7.1.1.4 | grep -n ${volume} | cut -d : -f1)

When it grep’s the list of volumes from the SNMP walk it returns two values and the script cannot cope so exits. After some playing around (and remembering the basics of writing bash scripts) I managed to work around the problem and changed the line to the following:

volarray=$(eval snmpwalk -v 2c -c ${community} ${host} 1.3.6.1.4.1.12740.5.1.7.1.1.4 | grep -n ""${volume}"" | cut -d : -f1)

The change adds the quotation marks that are surrounding the string value that is returned from the SNMPwalk so GREP should only return the exact matches. Having updated the script and re-run the checks the UNKNOWN status was gone and the checks all returned the correct data.

RSA Authentication Manager 7.1 upgrade issue

Following on from my article on the SQL files bug in RSA Authentication Manager 7.1 we were looking to carry out the upgrade to the client’s server in a maintenance window last weekend however the engineer carrying out the work was unable to login to the Operations Manager console to carry out certain parts of the upgrade task.

It turns out that since RSA was installed the Security Console Super Admin account had its password changed and in the updated documentation we lost the details of the password for the Operations Console as the two passwords are not linked. In order for us to get back into the Operations Console we had to run through the following:

  1. Create a new Super Admin from the Security Console in the Internal Database
  2. Run the RSA command line utility (C:Program FilesRSA SecurityRSA Authentication ManagerutilsRSAutil) to create a new Operations Console user account

Unfortunately it wasnt that easy to complete!

Initially when we ran RSAutil as one of the admin accounts we received an error stating that only one account could run it, the account that originally installed RSA! Luckily the account was still listed and we just needed to enable this and perform a swift “runas” to bring up a command prompt as that user.

Next we sent a good bit of time running through various commands to work out how we create a new Operations Console admin account. The final command that we needed to run was as follows:

rsautil manage-oc-administrators -a create -u UserCreatedEarlier -p PasswordForUserCreatedEarlier -g OperationsConsole-Administrators NewOperationsConsoleUsername NewOperationsConsolePassword

We were now able to login to the Operations Console using the account we created. Now to find another maintenance window to patch the RSA server

Exam Success: 70-647 – Pro: Windows Server 2008, Enterprise Administrator

I’ve passed

Following a long break from completing my MCSA: Messaging in Server 2003 I have finally got round to updating this for the modern era and upgraded this first to MCTS in Windows Server 2008 and finally this afternoon completed my 70-647 exam to attain the qualification of Microsoft Certified IT Professional: Enterprise Administrator.

For those of you with an MCSA in Windows Server 2003 the upgrade is done with the following exams:

  • 70-648 – TS: Upgrading from Windows Server 2003 MCSA to, Windows Server 2008, Technology Specializations. This is equivalent to completing the following two exams 70-640 and 70-642
  • 70-680 – TS: Windows 7, Configuring. This is the client exam required as part of the qualification
  • 70-643 – TS: Windows Server 2008 Applications Infrastructure, Configuring
  • 70-647 – Pro: Windows Server 2008, Enterprise Administrator

Once Microsoft confirm it on the MCP site I will update the qualifications links on the left with the new logo.

Opsview: “ODW_STATUS WARNING – No update since” Workaround

For a while I have been seeing a daily ODW_STATUS_WARNING about no updates since 03:59:59 on my master opsview server. I was 90% sure this was due to the load that I put on the server (load average sits around 6 and goes up to 13 at certain times of the day) but still got bored of running cleanup_import and then import_runtime -i 1.

I started off by manually clearing out all but 1 week of data from the runtime database (this is run as part of opsview_master_housekeep for various tables) and this didnt resolve the issue. In the end I modified my cron table so that the rc.opsview cron_daily task runs 30 minutes later (at 41 minutes past the hour instead of 11 minutes past. Since changing that I seem to have had no further re-occurrences of the No update prompt.

I am aware that each time I update Opsview I am going to have to make this change until I manage to move the databases to their own host and rebuild the master server onto new hardware but its a workaround for now!

For reference the crontab now looks like:

# OPSVIEW-START
# Do not remove comment above. Everything between OPSVIEW-START and OPSVIEW-END
# will be automatically installed as part of an Opsview install/upgrade
0,5,10,15,20,25,30,35,40,45,50,55 * * * * /usr/local/nagios/bin/mrtg_genstats.sh > /dev/null 2>&1
41 3 * * * /usr/local/nagios/bin/rc.opsview cron_daily > /dev/null 2>&1
22 2,6,10,14,18,22 * * * . /usr/local/nagios/bin/profile && /usr/local/nagios/bin/opsview_cronjobs 4hourly > /dev/null 2>&1
0,5,10,15,20,25,30,35,40,45,50,55 * * * * /usr/local/nagios/bin/call_nmis nmis.pl type=collect mthread=true > /dev/null 2>&1
34 0,4,8,12,16,20 * * * /usr/local/nagios/bin/call_nmis nmis.pl type=update mthread=true > /dev/null 2>&1
4 * * * * . /usr/local/nagios/bin/profile && /usr/local/nagios/bin/import_runtime -q
# NMIS reports
0 0 * * * /usr/local/nagios/bin/call_nmis run-reports.sh day health
0 0 * * * /usr/local/nagios/bin/call_nmis run-reports.sh day top10
0 0 * * * /usr/local/nagios/bin/call_nmis run-reports.sh day outage
0 0 * * * /usr/local/nagios/bin/call_nmis run-reports.sh day response
0 0 * * * /usr/local/nagios/bin/call_nmis run-reports.sh day avail
0 0 * * * /usr/local/nagios/bin/call_nmis run-reports.sh day port
0 0 * * 0 /usr/local/nagios/bin/call_nmis run-reports.sh week health
0 0 * * 0 /usr/local/nagios/bin/call_nmis run-reports.sh week top10
0 0 * * 0 /usr/local/nagios/bin/call_nmis run-reports.sh week outage
0 0 * * 0 /usr/local/nagios/bin/call_nmis run-reports.sh week response
0 0 * * 0 /usr/local/nagios/bin/call_nmis run-reports.sh week avail
0 0 * * 0 /usr/local/nagios/bin/call_nmis run-reports.sh week port
0 0 1 * * /usr/local/nagios/bin/call_nmis run-reports.sh month health
0 0 1 * * /usr/local/nagios/bin/call_nmis run-reports.sh month top10
0 0 1 * * /usr/local/nagios/bin/call_nmis run-reports.sh month outage
0 0 1 * * /usr/local/nagios/bin/call_nmis run-reports.sh month response
0 0 1 * * /usr/local/nagios/bin/call_nmis run-reports.sh month avail
0 0 1 * * /usr/local/nagios/bin/call_nmis run-reports.sh month port
# OPSVIEW-END