redundant DNS on ubuntu using IP takeover via heartbeat and pacemaker

When DNS is unavailable the internet always seems down. And sometimes it’s just really really slow waiting for timeouts. The other day the primary DNS server in myhouse failed and I once again thought that there had to be a better way. Since I already have a DNS slave I started hitting google land for ideas like anycast and load balancing. Lot’s of research later I found that I essentially had 3 choices:

  1. Setup resolve.conf to point at both servers using the rotate and timeout=0.5 options
  2. Setup a fully redundant load balanced setup via LVS+Keepalived
  3. Setup a HA IP takeover system that will always make sure that DNS is answering on the primary IP

Based on the title of the post you can probably guess which way I went, but before I get to it I’ll share why I didn’t go the other two routes.

The timeout option seems attractive and simple at first. Unfortunately I use DHCP to configure most of my systems and there doesn’t seem to be a way to push those configuration options to the clients.

Now the keepalived option really appeals to the geek in me, but it ends up going deep into the network routing realm. That doesn’t scare me, but most of the docs I’ve found talk about running Linux boxes with multiple interfaces and segmented internal and external networks. Again, not that scary, but way too much effort for my small home network. Now, there are ways that it can be setup using LVS-NAT on one network, but that has the downside that the DNS servers need to change their default route to go through the LVS system. Well the boxes do double duty and so I back them up and didn’t want to run all that backup traffic through the keepalived+LVS box.

So I was left with IP takeover approach and so far it seems to work splendid. Here is what I did based on the notes found at SETTING UP IP FAILOVER WITH HEARTBEAT AND PACEMAKER ON UBUNTU LUCID and Debian Lenny HowTo from clusterlabs. The instructions are ubuntu centric but should be easily adaptable for other Linux versions.

First install the needed software on both DNS servers:

sudo apt-get install heartbeat pacemaker

Next setup the heartbeat HA configuration by creating a the config file named /etc/ha.d/ha.cf with the following contents:

autojoin none
bcast eth2
warntime 3
deadtime 6
initdead 60
keepalive 1
node black
node glades
crm respawn

Note: make sure you use the right interface on each DNS box as the bcast setting and use the hostnames for the DNS servers in the node setting. In my case the primary is named black and the secondary glades

The next step is creating a an authentication key for heartbeat.

sudo sh -c '(echo -ne "auth 1\\n1 sha1 ";   dd if=/dev/urandom bs=512 count=1 | openssl md5 )  > /etc/ha.d/authkeys'

Then copy both /etc/ha.d/authkeys and /etc/ha.d/ha.cf to the other node and set the permissions on the /etc/ha.d/authkeys file and finally starting the heartbeat service.

sudo scp /etc/ha.d/* glades:/etc/ha.d/
sudo chmod 0600 /etc/ha.d/authkeys
ssh glades sudo chmod 0600 /etc/ha.d/authkey
sudo /etc/init.d/heartbeat start
ssh glades sudo /etc/init.d/heartbeat start

Note: Make sure the bcast setting is correct on the sister node. In my case the systems are using different interfaces.

To check all went well try:

[~]# sudo crm status | grep -i online
Online: [ glades black ]

Next you’ll setup pacemaker, by disabling stonith, configuring the failover IP, setting the preferred server and configure the monitoring to make the failover happen.

The crm command will be used again:

[~]# sudo crm configure
crm(live)configure# show
crm(live)configure# property stonith-enabled=false
crm(live)configure# primitive dns_ip IPaddr params ip="192.168.1.240" cidr_netmask="255.255.255.0"
crm(live)configure# location dns_ip_pref dns_ip 100: black
crm(live)configure# monitor dns_ip 40s:20s
crm(live)configure# primitive rndc_reload ocf:heartbeat:anything params binfile="/usr/sbin/rndc" cmdline_options="reload"
crm(live)configure# group VIP_rndc dns_ip rndc_reload meta target-role="Started"
crm(live)configure# commit
crm(live)configure# exit
bye

EDIT: Turns out the IP takeover was only part of the solution. As it seems Bind9 doesn’t bind itself to *0.0.0.0* and instead binds to a specific IP. Thus when the IP is taken over bind must be reloaded to start listening and answering on the new IP! The above crm commands are augmented to include a call toe rndc reload

Note: The IP address above is the one that will be use for the takeover, so in other words that’s where the DNS queries go.

Now you should be able to run DNS queries against the IP configured above.

You can force the IP address to the secondary by running

sudo crm resource migrate dns_ip

and then bring it back with:

sudo crm resource unmigrate dns_ip

If you run ifconfig on the nodes as you do this you will see a virtual interface show up on the active node, which listens on the virtual IP address.

There are other things you can do as well, like stopping the resource completely, putting a node in standby etc.

I’m pretty pleased at this point with the setup.

\\@matthias

Read More …