Advanced Server Monitoring with Riemann and Graphite

My current server monitoring setup is documented in my CentOS 5 server tutorials. It consists of Nagios for service monitoring and Cacti for graphing of metrics including system load, network and disk space.

Both tools are very commonly used and lots of resources are available on their setup & configuration, but I never kicked the feeling that they were plain clunky. Over the past several months, I have performed several research and evaluated a variety of tools and thankfully came across the monitoring sucks effort which aims to document a bunch of blog posts on monitoring tools and their different merits and weaknesses. The collection of all documentation the is now kept in the monitoring sucks GitHub repo.

Long story short, each tool seems to only do part of the job. I hate redundancy, and I believe that a good monitoring system would:

provide an overview of the current service status;
notify you appropriately and timely when things go wrong; and
provide a historical overview of data to establish some sort of baseline / normal level for collected metrics (i.e graphs and 99-percentiles)
ideally, be able to react proactively when things go wrong

You'll find that most tools will do two of four above well, which is just enough to be annoyingly useful. You'll need to implement 2-3 overlapping tools that do one thing well and the other just okay. Well, I don't like to live with workarounds.

Choosing the right tool for the job

I did a bit of research and solicited some advice on r/sysadmin, but sadly it did not get enough upvotes to be very noticed. Collectd looked like a wonderful utility. It is simple, high-performance and focused on doing one thing well. It was trivial to get it writing tons of system metrics to RRD files, at which point Visage provided a smooth user interface. Although it was a step in the right direction as far as what I was looking for, it still only did two of the four items above.

Introducing Riemann

Then, I stumbled across Riemann through his Monitorama 2013 presentation. Although not the easiest to configure and its notification support is a bit lacking, it has several features that immediately piqued my interest:

Its architecture forgoes the traditional polling and instead processes arbitrary event streams.
- Events can contain data (the metric) as well as other information (hostname, service, state, timestamp, tags, ttl)
- Events can be filtered by their attributes and transformed (percentiles, rolling averages, etc)
- Monitoring up new machines is as easy as pushing to your Riemann server from the new host
- Embed a Riemann client into your application or web service and easily add application level metrics
- Let collectd do what it does best and have it shove the machine's health metrics to Riemann as an event stream
It is built for scale, and can handle thousands of events per second
Bindings (clients) are available in multitudes of languages
Has (somewhat primitive) support for notifications and reacting to service failures, but Riemann is extensible so you can add what you need
An awesome, configurable dashboard

All of this is described more adequately and in greater detail on its homepage. So how do you get it?

Installing Riemann

This assumes you are running CentOS 6 or more better (e.g. recent version of Fedora). In the case of CentOS, it also assumes that you have installed the EPEL repository.

yum install ruby rubygems jre-1.6.0
gem install riemann-tools daemonize
rpm -Uhv http://aphyr.com/riemann/riemann-0.2.4-1.noarch.rpm
chkconfig riemann on
service riemann start

Be sure to open ports 5555 (both TCP and UDP), 5556 (TCP) and in your firewall. Riemann will uses 5555 for event submission, 5556 for a WebSockets connection to the server.

Riemann is now ready to go and accept events. You can modify your configuration at /etc/riemann/riemann.config as required - here is a sample from my test installation:

; -*- mode: clojure; -*-
; vim: filetype=clojure
(logging/init :file "/var/log/riemann/riemann.log")
; Listen on the local interface over TCP (5555), UDP (5555), and websockets (5556)
(let [host "my.hostname.tld"]
  (tcp-server :host host)
  (udp-server :host host)
  (ws-server  :host host))
; Expire old events from the index.
(periodically-expire 5)
; Custom stuffs
; Graphite server - connection pool
(def graph (graphite {:host "localhost"}))
; Email handler
(def email (mailer {:from "riemann@my.hostname.tld"}))
; Keep events in the index for 5 minutes by default.
(let [index (default :ttl 300 (update-index (index)))]
  ; Inbound events will be passed to these streams:
  (streams
    (where (tagged "rollingavg")
      (rate 5
        (percentiles 15 [0.5 0.95 0.99] index)
        index graph
      )
      (else
        index graph
      )
    )
    ; Calculate an overall rate of events.
    (with {:metric 1 :host nil :state "ok" :service "events/sec" :ttl 5}
      (rate 5 index))

; Log expired events. (expired (fn [event] (info "expired" event))) ))

The default configuration was modified here to do a few things differently:

Expire old events after only 5 seconds
Automatically calculate percentiles for events tagged with rollingavg
Send all event data to Graphite for graphing and archival
Set an email handler that, with some minor changes, could be used to send service state change notifications

Installing Graphite

Graphite can take data processed by Riemann and store it long-term, while also giving you tons of neat graphs.

yum --enablerepo=epel-testing install python-carbon python-whisper graphite-web httpd

We now need to edit /etc/carbon/storage-schemas.conf to tweak the time density of retained metrics. Since Riemann supports processing events quickly, I like to retain events at a higher precision than the default settings:

# Schema definitions for Whisper files. Entries are scanned in order,
# and first match wins. This file is scanned for changes every 60 seconds.
#
#  [name]
#  pattern = regex
#  retentions = timePerPoint:timeToStore, timePerPoint:timeToStore, ...
# Carbon's internal metrics. This entry should match what is specified in
# CARBON_METRIC_PREFIX and CARBON_METRIC_INTERVAL settings
[carbon]
pattern = ^carbon\.
retentions = 60:90d
#[default_1min_for_1day]
#pattern = .*
#retentions = 60s:1d

[primary] pattern = .* retentions = 10s:1h, 1m:7d, 15m:30d, 1h:2y

After making your changes, start the carbon-cache service:

service carbon-cache start
chkconfig carbon-cache on
touch /etc/carbon/storage-aggregation.conf

Now that Graphite's storage backend, Carbon, is running, we need to start Graphite:

python /usr/lib/python2.6/site-packages/graphite/manage.py syncdb
chown apache:apache /var/lib/graphite-web/graphite.db
service httpd graceful

Graphite should now be available on http://localhost - if this is undesirable, edit /etc/httpd/conf.d/graphite-web.conf and map it to a different hostname / URL according to your needs.

Note: as of writing, there's a bug in the version of python-carbon shipped with EL6 that complains incessantly to your logs if the storage-aggregation.conf configuration file doesn't exist. Let's create it to avoid a hundred-megabyte log file:

touch /etc/carbon/storage-aggregation.conf

But what about EL5

I am not going to detail how to install the full Riemann server on EL5, as the dependencies are far behind and it would require quite a bit of work. However, it is possible to install riemann-tools on RHEL/CentOS 5 for monitoring the machine with minimal work.

The rieman-health initscript requires the 'daemonize' command, install it via yum (EL6) or obtain it for EL5 here: http://pkgs.repoforge.org/daemonize/

The riemann-tools ruby gem and its dependencies will require a few development packages in order to build, as well as Karan's repo providing an updated ruby-1.8.7:

cat << EOF >> /etc/yum.repos.d/karan-ruby.repo
[kbs-el5-rb187]
name=kbs-el5-rb187
enabled=1
baseurl=http://centos.karan.org/el\$releasever/ruby187/\$basearch/
gpgcheck=1
gpgkey=http://centos.karan.org/RPM-GPG-KEY-karan.org.txt
EOF
yum update ruby\*
yum install ruby-devel libxml2-devel libxslt-devel libgcrypt-devel libgpg-error-devel
gem install riemann-tools --no-ri --no-rdoc