Migrating a live server to another host with no downtime

I have had a 1U server co-located for some time now at iWeb Technologies' datacenter in Montreal. So far I've had no issues and it did a wonderful job hosting websites & a few other VMs, but because of my concern for its aging hardware I wanted to migrate away before disaster struck.

Modern VPS offerings are a steal in terms of they performance they offer for the price, and Linode's 4096 plan caught my eye at a nice sweet spot. Backed by powerful CPUs and SSD storage, their VPS is blazingly fast and the only downside is I would lose some RAM and HDD-backed storage compared to my 1U server. The bandwidth provided wit the Linode was also a nice bump up from my previous 10Mbps, 500GB/mo traffic limit.

When CentOS 7 was released I took the opportunity to immediately start modernizing my CentOS 5 configuration and test its configuration. I wanted to ensure full continuity for client-facing services - other than a nice speed boost, I wanted clients to take no manual action on their end to reconfigure their devices or domains.

I also wanted to ensure zero downtime. As the DNS A records are being migrated, I didn't want emails coming in to the wrong server (or clients checking a stale inboxes until they started seeing the new mailserver IP). I can easily configure Postfix to relay all incoming mail on the CentOS 5 server to the IP of the CentOS 7 one to avoid any loss of emails, but there's still the issue that some end users might connect to the old server and get served their old IMAP inbox for some time.

So first things first, after developing a prototype VM that offered the same service set I went about buying a small Linode for a month to test the configuration some of my existing user data from my CentOS 5 server. MySQL was sufficiently easy to migrate over and Dovecot was able to preserve all UUIDs, so my inbox continued to sync seamlessly. Apache complained a bit when importing my virtual host configurations due to the new 2.4 syntax, but nothing a few sed commands couldn't fix. So with full continuity out of the way, I had to develop a strategy to handle zero downtime.

With some foresight and DNS TTL adjustments, we can get near zero downtime assuming all resolvers comply with your TTL. Simply set your TTL to 300 (5 minutes) a day or so before the migration occurs and as your old TTL expires, resolvers will see the new TTL and will not cache the IP for as long. Even with a short TTL, that's still up to 5 minutes of downtime and clients often do bad things... The IP might still be cached (e.g. at the ISP, router, OS, or browser) for longer. Ultimately, I'm the one that ends up looking bad in that scenario even though I have done what I can on the server side and have no ability to fix the broken clients.

To work around this, I discovered an incredibly handy tool socat that can make magic happen. socat routes data between sockets, network connections, files, pipes, you name it. Installing it is as easy as: yum install socat

A quick script later and we can forward all connections from the old host to the new host:

#!/bin/sh
NEWIP=0.0.0.0

# Stop services on this host
for SERVICE in dovecot postfix httpd mysqld;do
  /sbin/service $SERVICE stop
done

# Some cleanup
rm /var/lib/mysql/mysql.sock

# Map the new server's MySQL to localhost:3307
# Assumes capability for password-less (e.g. pubkey) login
ssh $NEWIP -L 3307:localhost:3306 &
socat unix-listen:/var/lib/mysql/mysql.sock,fork,reuseaddr,unlink-early,unlink-close,user=mysql,group=mysql,mode=777 TCP:localhost:3307 &

# Map ports from each service to the new host
for PORT in 110 995 143 993 25 465 587 80 3306;do
  echo "Starting socat on port $PORT..."
  socat TCP-LISTEN:$PORT,fork TCP:${NEWIP}:${PORT} &
  sleep 1
done

And just like that, every connection made to the old server is immediately forwarded to the new one. This includes the MySQL socket (which is automatically used instead of a TCP connection a host of 'localhost' is passed to MySQL).

Note how we establish a SSH tunnel mapping a connection to localhost:3306 on the new server to port 3307 on the old one instead of simply forwarding the connection and socket to the new server - this is done so that if you have users who are permitted on 'localhost' only, they can still connect (forwarding the connection will deny access due to a connection from a unauthorized remote host).

Update: a friend has pointed out this video to me, if you thought 0 downtime was bad enough... These guys move a live server 7km through public transport without losing power or network!