Stuff you can do quickly with MRTG (that has nothing to do with router traffic)

Come on, everyone knows MRTG, right? That venerable tool for graphing router traffic, among other things. If you’ve worked as a sysadmin and/or network admin in the past, you’re probably familiar with it since decades ago, although these days you probably use something more modern, such as Cacti, Zabbix, etc..

The thing is, I still use MRTG a lot, old though it is, and even though there are many newer alternatives, since MRTG has, to me, one huge advantage. Well, two, in fact, though they really amount to one thing: ease of use. MRTG is simple: it doesn’t need to run as a daemon, it’s just a binary you can run through cron. And MRTG is versatile: unlike other tools that really, really want to work with network traffic, or system memory, or processor usage, or something else (preferably to be read through SNMP, or a local agent), MRTG just wants one or two values, and it’ll call the first one “input” and the second “output”. And those values can come from anywhere.

Just to show how easy it is: an MRTG configuration file usually looks something like this:

WorkDir:/var/www/htdocs/ptmail/

Refresh: 36000

Target[ptmail]: `/usr/local/bin/getptmail.sh`;
Options[ptmail]: growright, nopercent, integer, noinfo, gauge
MaxBytes[ptmail]: 12500000000000
kilo[ptmail]: 1000
YLegend[ptmail]: MB used / Emails
ShortLegend[ptmail]:  
LegendI[ptmail]:  MB used (uncompressed):
LegendO[ptmail]:  Emails:
Title[ptmail]: MB used / Emails
PageTop[ptmail]: <h1>MB used / Emails</h1>

XSize[ptmail]: 500
YSize[ptmail]: 250
XScale[ptmail]: 1.4
YScale[ptmail]: 1.4

Timezone[ptmail]: Europe/Lisbon

The key part is that “Target” line which, you’ll notice, is between backticks, meaning that it’ll actually execute a script. And what does it expect from that script? Simple: just four lines:

  • A number (which it’ll think of as “input”, and which will typically be graphed in green)
  • Another number (for “output”, to be graphed in blue)
  • The system’s uptime (on Linux, uptime | cut -d ” ” -f 4-7 | cut -d “,” -f 1-2 typically provides it in the format MRTG wants, although it’s really just text)
  • The system’s name (just use uname -n)

That’s it! If you only want to graph a single value, just add “noi” or “noo” to the Options line (and you can then replace the first or second line, respectively, with “echo 0” on your script). You can also make it not show the uptime and name with the “noinfo” option. And, finally, the “gauge” option makes MRTG graph current values, while, without it, it adds the current value to the previous one (like, for instance, a “bytes transferred” counter). Both have their uses.

Now, you’re probably thinking: “yeah, yeah, I’ve known how to use MRTG for decades (and these days I use XYZ instead; who even uses MRTG in 2017?!?); why are you writing about something so basic and, well, old?” The answer is, again, 1) that it’s ridiculously easy to use (just create a script to write 4 lines), and 2) that it’s not for routers (or network interfaces) only; it can be used for so much else, and it can typically be done in minutes. So I’ll just show a few examples of stuff I already do with it on my servers. Each example will be just the shell script; the MRTG configuration file for these is virtually always the same thing, except for axis legends, labels, etc..

Note: all of the following need the “gauge” option.

“Steal” time on a VPS

#!/bin/bash

# Graphs "steal" time on a VPS. A high value means you need to complain to your
# VPS provider that the host your server is on is overbooked, or there's
# another customer abusing it (mining Bitcoins? 🙂 )

NUM=$((3 + ($RANDOM % 3)))

rm -f /tmp/steals.txt

top -b -n $NUM | grep Cpu | cut -d ',' -f 8 | tr -d ' ' | cut -d 's' -f 1 > /tmp/steals.txt

TOTAL=0 ; for i in `cat /tmp/steals.txt`; do TOTAL=`echo "$TOTAL + $i" | bc -l`; done

AVERAGE=`echo "$TOTAL / $NUM" | bc -l`

echo 0
echo $AVERAGE
uptime | cut -d " " -f 4-7 | cut -d "," -f 1-2
uname -n

Total size and number of emails in a Maildir

#!/bin/bash

# Total size and number of mails in a Maildir

# total UNCOMPRESSED size of your mailbox, in MB, 2 decimal cases
# actual compressed size will probably be much lower; if you'd rather show it,
# just replace the following with a "du -ms /home/ptmail/Maildir | cut -f 1"
find /home/ptmail/Maildir -type f | grep drax | cut -d '=' -f 2 | cut -d ',' -f 1 > /tmp/sizes-ptmail ; perl -e "printf \"%.2f\n\", `paste -s -d+ /tmp/sizes-ptmail | bc` / 1024 / 1024"

# number of emails
find /home/ptmail/Maildir -type f | grep drax | wc -l

uptmailime | cut -d " " -f 4-7 | cut -d "," -f 1-2
uname -n

System load (* 100) and number of processes

#!/usr/bin/php
<?php

# Graph system load (multiplied by 100) and number of processes.

# Yes, this is trivial and uses mostly shell commands. 🙂 It's just to show
# that "scripts" invoked by MRTG can be in any language (in this case, PHP),
# not just shell scripts. You could even use compiled C code, for instance.

$x = exec ("uptime | cut -d ':' -s -f 5 | cut -d ',' -f 1 | cut -d ' ' -f 2");

$x*=100;

print $x . "\n";

system ("ps ax | wc -l");

system ("uptime | cut -d ' ' -f 4-7 | cut -d ',' -f 1-2");
system ("uname -n");

?>

Average time per request of a URL

#!/bin/bash

# Average time per request of the URL below. Run it locally to measure how fast
# your site and/or server is, or remotely to measure its network connection as well
# (though that would be less reliable; best use something like Pingdom)

# use "noi" for this one (hence the first "echo 0")

echo 0
ab -c 4 -n 100 -k https://zurgl.com/ | grep "Time per request" | head -n 1 | cut -d ':' -f 2 | tr -d ' ' | cut -d '[' -f 1

uptime | cut -d " " -f 4-7 | cut -d "," -f 1-2
uname -n

I could go on, but I’m sure you get the idea. 🙂

Linux: Detecting potential disk space problems before going home

As is typical in sysadmin work, one member of my team is always on call, having to be available during the night and on weekends. Sometimes the problems are big (especially hardware failures), sometimes they’re trivial, and sometimes they’re false alarms, but one thing is always true: it’s not fun to be woken up by a ringing phone, especially when you’re already not having enough sleep.

One of the most common reasons for being called at night (or during weekends) is when a disk’s occupation exceeds a threshold (let’s say 95%, although that value varies in reality). Logs get larger (especially when some error is constantly occurring), old logs aren’t compressed/rotated/deleted automatically, users leave huge debug files somewhere and forget to delete them for years… all of these happen, it’s mostly inevitable.

But, if we’re woken up at 4 A.M. because a /var partition reached 95% utilization, isn’t it true that in most cases the increase was gradual, and the value was already abnormally high — let’s say 93 or 94% — at 4 P.M.?

And wouldn’t it have been much better (for our well-earned rest, among other things) for the person on call to have fixed the problem during the day, before going home?

I thought so, too. Which is why I wrote a script some months ago to detect which servers (from more than a thousand) will need attention soon.

(Looking at that script now, I’m a bit ashamed to share it, as it’s pretty lazily coded, repeating code one time for each filesystem instead of having some kind of function… but it was made in a bit of a hurry some time ago, so because of laziness… I mean, historical accuracy 🙂 I’ll share it as it is (other than translating variable names and output messages from my native Portuguese)).

So, here it is. Don’t laugh too hard, OK? 🙂

#!/bin/bash

# check / partition
ROOT=`df -hl / | tail -1 | awk '{ print $5 }' | cut -d '%' -f 1`
case $ROOT in
    ''|*[!0-9]*) ROOT=`df -hl / | tail -1 | awk '{ print $4 }' | cut -d '%' -f 1` ;;
esac
case $ROOT in
    ''|*[!0-9]*) ROOT=`df -hl / | tail -1 | awk '{ print $3 }' | cut -d '%' -f 1` ;;
esac

TOP=$ROOT
WORST="/"

# check /var partition
VAR=`df -hl /var | tail -1 | awk '{ print $5 }' | cut -d '%' -f 1`
case $VAR in
    ''|*[!0-9]*) VAR=`df -hl /var | tail -1 | awk '{ print $4 }' | cut -d '%' -f 1` ;;
esac

if [ "$VAR" -gt "$TOP" ]; then
        TOP=$VAR
        WORST="/var"
fi

# check /home partition
# comment out this entire block if you don't want to monitor /home
HOME=`df -hl /home | tail -1 | awk '{ print $5 }' | cut -d '%' -f 1`
case $HOME in
    ''|*[!0-9]*) HOME=`df -hl /home | tail -1 | awk '{ print $4 }' | cut -d '%' -f 1` ;;
esac

if [ "$HOME" -gt "$TOP" ]; then
        TOP=$HOME
        WORST="/home"
fi


# check /tmp partition
TMP=`df -hl /tmp | tail -1 | awk '{ print $5 }' | cut -d '%' -f 1`
case $TMP in
    ''|*[!0-9]*) TMP=`df -hl /tmp | tail -1 | awk '{ print $4 }' | cut -d '%' -f 1` ;;
esac

if [ "$TMP" -gt "$TOP" ]; then
        TOP=$TMP
        WORST="/tmp"
fi


# check /opt partition
OPT=`df -hl /opt | tail -1 | awk '{ print $5 }' | cut -d '%' -f 1`
case $OPT in
    ''|*[!0-9]*) OPT=`df -hl /opt | tail -1 | awk '{ print $4 }' | cut -d '%' -f 1` ;;
esac

if [ "$OPT" -gt "$TOP" ]; then
        TOP=$OPT
        WORST="/opt"
fi


#echo "TOP = $TOP"

if [ "$TOP" -ge 70 ] && [ "$TOP" -le 80 ]; then
	echo "Worst partition: ($WORST) at $TOP"
        exit 7
fi


if [ "$TOP" -ge 80 ] && [ "$TOP" -le 90 ]; then
	echo "Worst partition: ($WORST) at $TOP"
        exit 8
fi

if [ "$TOP" -ge 90 ] && [ "$TOP" -le 95 ]; then
	echo "Worst partition: ($WORST) at $TOP"
        exit 9
fi

if [ "$TOP" -ge 95 ] && [ "$TOP" -lt 98 ]; then
	echo "Worst partition: ($WORST) at $TOP"
        exit 10
fi

if [ "$TOP" -eq 98 ]; then
	echo "Worst partition: ($WORST) at $TOP"
        exit 11
fi

if [ "$TOP" -eq 99 ]; then
	echo "Worst partition: ($WORST) at $TOP"
        exit 12
fi


if [ "$TOP" -gt 99 ]; then
	echo "Worst partition: ($WORST) at $TOP :("
        exit 20
fi

echo "Everything OK: worst partition: ($WORST) at $TOP"
exit 0

The goal is to run it on every server you’re responsible for (we use an HP automation tool where I work, but it could be done in many ways, including with a central machine who could ssh “passwordless-ly” into all other servers — it wouldn’t need root access for that, since it doesn’t change anything anywhere. Afterwards, you’re supposed to sort the output by exit codes; anything other than 0 might indicate a problem, and the higher, the worse it is; exit code 20 means a partition at 100%. Again, that HP tool does that sorting easily, but it’s far from being the only way — if you’re reading this, you’ve probably already thought of some.

And then, naturally, you’re supposed to go through the worst cases, either by deleting obvious trash, or by contacting the application teams so that they clean up their stuff.

Depending on your company’s policies, you may want to comment out the block that refers to the /home partition, if that is completely the responsibility of the users.

I don’t have statistics, of course, but I can tell you that being called because of a(n almost) full filesystem, which was relatively frequent a year or so ago, is now extremely rare (only happening when — after work hours — an application goes really wrong (typically because an untested settings change) and starts dumping error messages/debug logs as fast as it can). If only hardware problems didn’t exist, either…

Linux: create a Volume Group with all newly added disks

Let’s say you’ve just added one or more disk drives to a (physical or virtual) Linux system, and you know you want to create a volume group named “vgdata” with all of them — or add them to that VG if it already exists.

For extra fun, let’s also say you want to do it to a lot of systems at the same time, and they’re a heterogeneous bunch — some of them may have the “vgdata” VG already, while some don’t; some of them may have had just one new disk added to it, while others got several. How to script it?

#!/bin/bash

# create full-size LVM partitions on all drives with no partitions yet; also create PVs for them
for i in b c d e f g h i j k l m n o p q r s t u v w x y z; do sfdisk -s /dev/sd$i >/dev/null 2>&1 && ( sfdisk -s /dev/sd${i}1 >/dev/null 2>&1 || ( parted /dev/sd$i mklabel msdos && parted -a optimal /dev/sd$i mkpart primary ext4 "0%" "100%" && parted -s /dev/sd$i set 1 lvm on && pvcreate /dev/sd${i}1 ) ) ; done

# if the "vgdata" VG exists, extend it with all unused PVs...
vgs | grep -q vgdata && pvs --no-headings -o pv_name -S vg_name="" | sed 's/^ *//g' | xargs vgextend vgdata

# ... otherwise, create it with those same PVs
vgs | grep -q vgdata || pvs --no-headings -o pv_name -S vg_name="" | sed 's/^ *//g' | xargs vgcreate vgdata

As always, you can use your company’s automation system to run it on a bunch of servers, or use pssh, or a bash “for” cycle, or…