Network and data centre improvements
Over the last few weeks we’ve been working away behind the scenes in our data centres to make Clockwork even more reliable. The best thing is that as we run a complete copy of our systems in each site, we were able to do the work while keeping all our services running as normal.
New network switches
We were running out of network ports in Leeds so rather than just adding another switch we took the chance to improve reliability too. Gone are our pair of 24 port switches and in their place is a stack of 48 port HP switches. These new stackable switches support link aggregation which allows us to replace the single cable linking each server to a switch with a pair - each to a different switch. With pairs of cables in place we’ve got faster network communication and massively improved reliability, somebody can now unplug the wrong cable by mistake, accidentally damage a cable or perform maintenance and the only thing that’ll notice is our monitoring server.
Intelligent load balancing
Clockwork runs on 4 web servers, a pair in each data centre, and we load balance traffic across each pair. Previously this was done with Windows Network Load Balancing, however it isn’t all that intelligent and just checks that a server is responding not what it’s responding with. So a server could just be returning internal server errors and Network Load Balancing would still send it traffic - obviously not an ideal situation but much better than just a single server.
Keepalived has one very simple job, it makes sure the Clockwork IP addresses always point to a server with a working instance of HA Proxy, it does this using Virtual Router Redundancy Protocol (VRRP). Under VRRP one server is always live (the master) and the other one sits there in standby (the slave), in the event of the master failing the slave will take over its role within seconds keeping services running.
HA Proxy is designed to distribute incoming web traffic across a number of web servers, it’s used by some big names including Twitter and Stack Overflow. Its big advantage over Network Load Balancing is it can actively check the health of those web servers before sending them traffic. If they’re returning an error they’ll be taken out of use and you’ll only ever hit the remaining healthy servers.
A general tidy up
Given that replacing our network switches involved removing a good chunk of our cabling we took the chance to do some general housekeeping too. Every single cable in both data centres is now colour coded to match its use - red power cables go to core network devices, orange to single power supplies and black and white ones are only used for redundant power supplies. All cables are correctly labelled at both ends and we’ve shuffled a few servers around so that Derby and Leeds now have identical set ups. Although these changes don’t make any difference at all to everyday use, they do reduce the chance of mistakes when we’re performing maintenance - every member of staff now knows not to touch a red power cable!
At the moment our HA Proxy servers are just load balancing traffic in their local data centre, our next step is to sort out firewall rules and tweak configurations to allow them to fall back to the opposite site. This would let us handle any major failures further up our text messaging stack completely transparently. We hope to get this configured, tested and live over the next few weeks.