server

Using Homebridge with cmdscript2 to control your Linux machine over HomeKit

A few of my tech projects experience occasional hiccups and need to be soft-reset from my Linux host (e.g. Wi-Fi SSID routed through VPN, Windows gaming VM with hardware passthough). This was annoying as it meant having a machine nearby to SSH and execute a few simple commands -- often just a systemctl restart foo. Fortunately, homebridge-cmdswitch2 can easily expose arbitrary commands as lights so I would be able to easily bounce the services via my phone.

First, since Homebridge should be running as its own system user, we need to give it permissions to restart services (as root). We don't want to grant services to all of /bin/systemctl, so a wrapper script will be placed at /usr/local/bin/serviceswitch to encapsulate the desired behavior. Grant the homebridge user permission to run it with sudo:

cat << EOF > /etc/sudoers.d/homebridge-cmdswitch
homebridge ALL = (root) NOPASSWD: /usr/local/bin/serviceswitch

Next, let's create that /usr/local/bin/serviceswitch script with service status, start and stop commands - using such a wrapper also has the benefit that complex checks consisting of several commands can be performed. Keep in mind these are now being run as root from Homebridge!

#!/bin/sh

if [ "$(id -u)" -ne 0 ];then
  echo "You must run this script as root."
  exit 1
fi

usage() {
  error="$1"
  if [ ! -z "$error" ];then
    echo "Error: $error"
  fi
  echo "Usage: $0 [action] [service]"
}

action="$1"
service="$2"
if [ -z "$action" ] || [ -z "$service" ];then
  usage
  exit 1
fi

case $action in
  start|stop|status) ;;
  *) usage "invalid action, must be one of [start, stop, status]"; exit 1;;
esac

case $service in
  vm-guests)
    [ "$action" == "start" ] && (systemctl start libvirt-guests)
    [ "$action" == "stop" ] && (systemctl stop libvirt-guests)
    [ "$action" == "status" ] && { systemctl -q is-active libvirt-guests; exit $?; }
    ;;
  fileserver)
    [ "$action" == "start" ] && (systemctl start smb;systemctl start nmb;systemctl start netatalk)
    [ "$action" == "stop" ] && (systemctl stop smb;systemctl stop nmb;systemctl stop netatalk)
    [ "$action" == "status" ] && { (systemctl -q is-active smb && systemctl -q is-active nmb && systemctl -q is-active netatalk); exit $?; }
    ;;
  web)
    [ "$action" == "start" ] && (systemctl start httpd)
    [ "$action" == "stop" ] && (systemctl stop httpd)
    [ "$action" == "status" ] && { systemctl -q is-active httpd; exit $?; }
    ;;
  *) usage "invalid service"; exit 1;;
esac
exit 0

Finally, here is the relevant platform section from the homebridge config:

{
  "platforms": [{
    "platform": "cmdSwitch2",
    "name": "Command Switch",
    "switches": [{
       "name" : "vm-guests",
        "on_cmd": "sudo /usr/local/bin/serviceswitch start vm-guests",
        "off_cmd": "sudo /usr/local/bin/serviceswitch stop vm-guests",
        "state_cmd": "sudo /usr/local/bin/serviceswitch status vm-guests",
        "polling": false,
        "interval": 5,
        "timeout": 10000
    },
    {
       "name" : "fileserver",
        "on_cmd": "sudo /usr/local/bin/serviceswitch start fileserver",
        "off_cmd": "sudo /usr/local/bin/serviceswitch stop fileserver",
        "state_cmd": "sudo /usr/local/bin/serviceswitch status fileserver",
        "polling": false,
        "interval": 5,
        "timeout": 10000
    },
    {
       "name" : "web",
        "on_cmd": "sudo /usr/local/bin/serviceswitch start web",
        "off_cmd": "sudo /usr/local/bin/serviceswitch stop web",
        "state_cmd": "sudo /usr/local/bin/serviceswitch status web",
        "polling": false,
        "interval": 5,
        "timeout": 10000
    }]
  }]
}

Home server with Docker containers via linuxserver.io

After a few years of meticulously maintaining a large shell script that setup my Fedora home server, finally got around to containerizing a good portion of it thanks to the fine team at linuxserver.io.

As the software set I tried to maintain grew, there were a few challenges with dependencies and I ended up having to install/compile a few software titles myself, which I generally try to avoid at all costs (since that means I'm on the hook for regularly checking for security updates, addressing compatibility issues with OS upgrades, etc).

After getting my docker-compose file right, it's been wonderful - a simple docker-compose pull updates everything and a small systemd service keeps the docker images running at boot. Mapped volumes mean none of the data is outside my host, and I can also use the host networking mode for images that I want auto-discovery for (e.g. Plex or SMB).

Plus, seeing as I've implemented docker-compose as a systemd service, I am able depend on zfs-keyvault to ensure that any dependent filesystem are mounted and available. Hurray!

You check out a sample config for my setup in this GitHub gist.

Migrating a live server to another host with no downtime

I have had a 1U server co-located for some time now at iWeb Technologies' datacenter in Montreal. So far I've had no issues and it did a wonderful job hosting websites & a few other VMs, but because of my concern for its aging hardware I wanted to migrate away before disaster struck.

Modern VPS offerings are a steal in terms of they performance they offer for the price, and Linode's 4096 plan caught my eye at a nice sweet spot. Backed by powerful CPUs and SSD storage, their VPS is blazingly fast and the only downside is I would lose some RAM and HDD-backed storage compared to my 1U server. The bandwidth provided wit the Linode was also a nice bump up from my previous 10Mbps, 500GB/mo traffic limit.

When CentOS 7 was released I took the opportunity to immediately start modernizing my CentOS 5 configuration and test its configuration. I wanted to ensure full continuity for client-facing services - other than a nice speed boost, I wanted clients to take no manual action on their end to reconfigure their devices or domains.

I also wanted to ensure zero downtime. As the DNS A records are being migrated, I didn't want emails coming in to the wrong server (or clients checking a stale inboxes until they started seeing the new mailserver IP). I can easily configure Postfix to relay all incoming mail on the CentOS 5 server to the IP of the CentOS 7 one to avoid any loss of emails, but there's still the issue that some end users might connect to the old server and get served their old IMAP inbox for some time.

So first things first, after developing a prototype VM that offered the same service set I went about buying a small Linode for a month to test the configuration some of my existing user data from my CentOS 5 server. MySQL was sufficiently easy to migrate over and Dovecot was able to preserve all UUIDs, so my inbox continued to sync seamlessly. Apache complained a bit when importing my virtual host configurations due to the new 2.4 syntax, but nothing a few sed commands couldn't fix. So with full continuity out of the way, I had to develop a strategy to handle zero downtime.

With some foresight and DNS TTL adjustments, we can get near zero downtime assuming all resolvers comply with your TTL. Simply set your TTL to 300 (5 minutes) a day or so before the migration occurs and as your old TTL expires, resolvers will see the new TTL and will not cache the IP for as long. Even with a short TTL, that's still up to 5 minutes of downtime and clients often do bad things... The IP might still be cached (e.g. at the ISP, router, OS, or browser) for longer. Ultimately, I'm the one that ends up looking bad in that scenario even though I have done what I can on the server side and have no ability to fix the broken clients.

To work around this, I discovered an incredibly handy tool socat that can make magic happen. socat routes data between sockets, network connections, files, pipes, you name it. Installing it is as easy as: yum install socat

A quick script later and we can forward all connections from the old host to the new host:

#!/bin/sh
NEWIP=0.0.0.0

# Stop services on this host
for SERVICE in dovecot postfix httpd mysqld;do
  /sbin/service $SERVICE stop
done

# Some cleanup
rm /var/lib/mysql/mysql.sock

# Map the new server's MySQL to localhost:3307
# Assumes capability for password-less (e.g. pubkey) login
ssh $NEWIP -L 3307:localhost:3306 &
socat unix-listen:/var/lib/mysql/mysql.sock,fork,reuseaddr,unlink-early,unlink-close,user=mysql,group=mysql,mode=777 TCP:localhost:3307 &

# Map ports from each service to the new host
for PORT in 110 995 143 993 25 465 587 80 3306;do
  echo "Starting socat on port $PORT..."
  socat TCP-LISTEN:$PORT,fork TCP:${NEWIP}:${PORT} &
  sleep 1
done

And just like that, every connection made to the old server is immediately forwarded to the new one. This includes the MySQL socket (which is automatically used instead of a TCP connection a host of 'localhost' is passed to MySQL).

Note how we establish a SSH tunnel mapping a connection to localhost:3306 on the new server to port 3307 on the old one instead of simply forwarding the connection and socket to the new server - this is done so that if you have users who are permitted on 'localhost' only, they can still connect (forwarding the connection will deny access due to a connection from a unauthorized remote host).

Update: a friend has pointed out this video to me, if you thought 0 downtime was bad enough... These guys move a live server 7km through public transport without losing power or network!

Advanced Server Monitoring with Riemann and Graphite

My current server monitoring setup is documented in my CentOS 5 server tutorials. It consists of Nagios for service monitoring and Cacti for graphing of metrics including system load, network and disk space.

Both tools are very commonly used and lots of resources are available on their setup & configuration, but I never kicked the feeling that they were plain clunky. Over the past several months, I have performed several research and evaluated a variety of tools and thankfully came across the monitoring sucks effort which aims to document a bunch of blog posts on monitoring tools and their different merits and weaknesses. The collection of all documentation the is now kept in the monitoring sucks GitHub repo.

Long story short, each tool seems to only do part of the job. I hate redundancy, and I believe that a good monitoring system would:

  1. provide an overview of the current service status;
  2. notify you appropriately and timely when things go wrong; and
  3. provide a historical overview of data to establish some sort of baseline / normal level for collected metrics (i.e graphs and 99-percentiles)
  4. ideally, be able to react proactively when things go wrong

You'll find that most tools will do two of four above well, which is just enough to be annoyingly useful. You'll need to implement 2-3 overlapping tools that do one thing well and the other just okay. Well, I don't like to live with workarounds.

Choosing the right tool for the job

I did a bit of research and solicited some advice on r/sysadmin, but sadly it did not get enough upvotes to be very noticed. Collectd looked like a wonderful utility. It is simple, high-performance and focused on doing one thing well. It was trivial to get it writing tons of system metrics to RRD files, at which point Visage provided a smooth user interface. Although it was a step in the right direction as far as what I was looking for, it still only did two of the four items above.

Introducing Riemann

Then, I stumbled across Riemann through his Monitorama 2013 presentation. Although not the easiest to configure and its notification support is a bit lacking, it has several features that immediately piqued my interest:

  • Its architecture forgoes the traditional polling and instead processes arbitrary event streams.
    • Events can contain data (the metric) as well as other information (hostname, service, state, timestamp, tags, ttl)
    • Events can be filtered by their attributes and transformed (percentiles, rolling averages, etc)
    • Monitoring up new machines is as easy as pushing to your Riemann server from the new host
    • Embed a Riemann client into your application or web service and easily add application level metrics
    • Let collectd do what it does best and have it shove the machine's health metrics to Riemann as an event stream
  • It is built for scale, and can handle thousands of events per second
  • Bindings (clients) are available in multitudes of languages
  • Has (somewhat primitive) support for notifications and reacting to service failures, but Riemann is extensible so you can add what you need
  • An awesome, configurable dashboard

All of this is described more adequately and in greater detail on its homepage. So how do you get it?

Installing Riemann

This assumes you are running CentOS 6 or more better (e.g. recent version of Fedora). In the case of CentOS, it also assumes that you have installed the EPEL repository.

yum install ruby rubygems jre-1.6.0
gem install riemann-tools daemonize
rpm -Uhv http://aphyr.com/riemann/riemann-0.2.4-1.noarch.rpm
chkconfig riemann on
service riemann start

Be sure to open ports 5555 (both TCP and UDP), 5556 (TCP) and in your firewall. Riemann will uses 5555 for event submission, 5556 for a WebSockets connection to the server.

Riemann is now ready to go and accept events. You can modify your configuration at /etc/riemann/riemann.config as required - here is a sample from my test installation:

; -*- mode: clojure; -*-
; vim: filetype=clojure

(logging/init :file "/var/log/riemann/riemann.log")

; Listen on the local interface over TCP (5555), UDP (5555), and websockets (5556)
(let [host "my.hostname.tld"]
  (tcp-server :host host)
  (udp-server :host host)
  (ws-server  :host host))

; Expire old events from the index.
(periodically-expire 5)

; Custom stuffs

; Graphite server - connection pool
(def graph (graphite {:host "localhost"}))
; Email handler
(def email (mailer {:from "riemann@my.hostname.tld"}))

; Keep events in the index for 5 minutes by default.
(let [index (default :ttl 300 (update-index (index)))]

  ; Inbound events will be passed to these streams:
  (streams

    (where (tagged "rollingavg")
      (rate 5
        (percentiles 15 [0.5 0.95 0.99] index)
        index graph
      )
      (else
        index graph
      )
    )

    ; Calculate an overall rate of events.
    (with {:metric 1 :host nil :state "ok" :service "events/sec" :ttl 5}
      (rate 5 index))

    ; Log expired events.
    (expired
      (fn [event] (info "expired" event)))
))

The default configuration was modified here to do a few things differently:

  • Expire old events after only 5 seconds
  • Automatically calculate percentiles for events tagged with rollingavg
  • Send all event data to Graphite for graphing and archival
  • Set an email handler that, with some minor changes, could be used to send service state change notifications

Installing Graphite

Graphite can take data processed by Riemann and store it long-term, while also giving you tons of neat graphs.

yum --enablerepo=epel-testing install python-carbon python-whisper graphite-web httpd

We now need to edit /etc/carbon/storage-schemas.conf to tweak the time density of retained metrics. Since Riemann supports processing events quickly, I like to retain events at a higher precision than the default settings:

# Schema definitions for Whisper files. Entries are scanned in order,
# and first match wins. This file is scanned for changes every 60 seconds.
#
#  [name]
#  pattern = regex
#  retentions = timePerPoint:timeToStore, timePerPoint:timeToStore, ...

# Carbon's internal metrics. This entry should match what is specified in
# CARBON_METRIC_PREFIX and CARBON_METRIC_INTERVAL settings
[carbon]
pattern = ^carbon\.
retentions = 60:90d

#[default_1min_for_1day]
#pattern = .*
#retentions = 60s:1d

[primary]
pattern = .*
retentions = 10s:1h, 1m:7d, 15m:30d, 1h:2y

After making your changes, start the carbon-cache service:

service carbon-cache start
chkconfig carbon-cache on
touch /etc/carbon/storage-aggregation.conf

Now that Graphite's storage backend, Carbon, is running, we need to start Graphite:

python /usr/lib/python2.6/site-packages/graphite/manage.py syncdb
chown apache:apache /var/lib/graphite-web/graphite.db
service httpd graceful

Graphite should now be available on http://localhost - if this is undesirable, edit /etc/httpd/conf.d/graphite-web.conf and map it to a different hostname / URL according to your needs.

Note: as of writing, there's a bug in the version of python-carbon shipped with EL6 that complains incessantly to your logs if the storage-aggregation.conf configuration file doesn't exist. Let's create it to avoid a hundred-megabyte log file:

touch /etc/carbon/storage-aggregation.conf

But what about EL5

I am not going to detail how to install the full Riemann server on EL5, as the dependencies are far behind and it would require quite a bit of work. However, it is possible to install riemann-tools on RHEL/CentOS 5 for monitoring the machine with minimal work.

The rieman-health initscript requires the 'daemonize' command, install it via yum (EL6) or obtain it for EL5 here: http://pkgs.repoforge.org/daemonize/

The riemann-tools ruby gem and its dependencies will require a few development packages in order to build, as well as Karan's repo providing an updated ruby-1.8.7:

cat << EOF >> /etc/yum.repos.d/karan-ruby.repo
[kbs-el5-rb187]
name=kbs-el5-rb187
enabled=1
baseurl=http://centos.karan.org/el\$releasever/ruby187/\$basearch/
gpgcheck=1
gpgkey=http://centos.karan.org/RPM-GPG-KEY-karan.org.txt
EOF
yum update ruby\*
yum install ruby-devel libxml2-devel libxslt-devel libgcrypt-devel libgpg-error-devel
gem install riemann-tools --no-ri --no-rdoc

Building a home media server with ZFS and a gaming virtual machine

Work has kept me busy lately so it's been a while since my last post... I have been doing lots of research and collecting lots of information over the holiday break and I'm happy to say that in the coming days I will be posting a new server setup guide, this time for a server that is capable of running redundant storage (ZFS RAIDZ2), sharing home media (Plex Media Server, SMB, AFP) as well as a full Windows 7 gaming rig simultaneously!

Windows runs in a virtual machine and is assigned it's own real graphics card from the host's hardware using the using the brand-new VFIO PCI passthrough technique with the VGA quirks enabled. This does require a motherboard and CPU with support for IOMMU, more commonly known as VT-d or AMD-Vi.