When DNS is unavailable the internet always seems down. And sometimes it's just really really slow waiting for timeouts. The other day the primary DNS server in my house failed and I once again thought that there had to be a better way. Since I already have a DNS slave I started hitting google land for ideas like anycast and load balancing. Lot's of research later I found that I essentially had 3 choices:
- Setup resolve.conf to point at both servers using the
- Setup a fully redundant load balanced setup via LVS+Keepalived
- Setup a HA IP takeover system that will always make sure that DNS is answering on the primary IP
Based on the title of the post you can probably guess which way I went, but before I get to it I'll share why I didn't go the other two routes.
timeout option seems attractive and simple at first. Unfortunately I use DHCP to configure most of my systems and there doesn't seem to be a way to push those configuration options to the clients.
Now the keepalived option really appeals to the geek in me, but it ends up going deep into the network routing realm. That doesn't scare me, but most of the docs I've found talk about running Linux boxes with multiple interfaces and segmented internal and external networks. Again, not that scary, but way too much effort for my small home network. Now, there are ways that it can be setup using LVS-NAT on one network, but that has the downside that the DNS servers need to change their default route to go through the LVS system. Well the boxes do double duty and so I back them up and didn't want to run all that backup traffic through the keepalived+LVS box.
So I was left with IP takeover approach and so far it seems to work splendid. Here is what I did based on the notes found at SETTING UP IP FAILOVER WITH HEARTBEAT AND PACEMAKER ON UBUNTU LUCID and Debian Lenny HowTo from clusterlabs. The instructions are ubuntu centric but should be easily adaptable for other Linux versions.
First install the needed software on both DNS servers:
sudo apt-get install heartbeat pacemaker
Next setup the heartbeat HA configuration by creating a the config file named /etc/ha.d/ha.cf with the following contents:
autojoin none bcast eth2 warntime 3 deadtime 6 initdead 60 keepalive 1 node black node glades crm respawn
Note: make sure you use the right interface on each DNS box as the
bcast setting and use the hostnames for the DNS servers in the
node setting. In my case the primary is named black and the secondary glades
The next step is creating a an authentication key for heartbeat.
sudo sh -c '(echo -ne "auth 1\n1 sha1 "; dd if=/dev/urandom bs=512 count=1 | openssl md5 ) > /etc/ha.d/authkeys'
Then copy both /etc/ha.d/authkeys and /etc/ha.d/ha.cf to the other node and set the permissions on the /etc/ha.d/authkeys file and finally starting the heartbeat service.
sudo scp /etc/ha.d/* glades:/etc/ha.d/ sudo chmod 0600 /etc/ha.d/authkeys ssh glades sudo chmod 0600 /etc/ha.d/authkey sudo /etc/init.d/heartbeat start ssh glades sudo /etc/init.d/heartbeat start
Note: Make sure the
bcast setting is correct on the sister node. In my case the systems are using different interfaces.
To check all went well try:
[~]# sudo crm status | grep -i online Online: [ glades black ]
Next you'll setup pacemaker, by disabling stonith, configuring the failover IP, setting the preferred server and configure the monitoring to make the failover happen.
crm command will be used again:
[~]# sudo crm configure crm(live)configure# show crm(live)configure# property stonith-enabled=false crm(live)configure# primitive dns_ip IPaddr params ip="192.168.1.240" cidr_netmask="255.255.255.0" crm(live)configure# location dns_ip_pref dns_ip 100: black crm(live)configure# monitor dns_ip 40s:20s crm(live)configure# primitive rndc_reload ocf:heartbeat:anything params binfile="/usr/sbin/rndc" cmdline_options="reload" crm(live)configure# group VIP_rndc dns_ip rndc_reload meta target-role="Started" crm(live)configure# commit crm(live)configure# exit bye
EDIT: Turns out the IP takeover was only part of the solution. As it seems Bind9 doesn't bind itself to *0.0.0.0* and instead binds to a specific IP. Thus when the IP is taken over bind must be reloaded to start listening and answering on the new IP! The above crm commands are augmented to include a call toe
Note: The IP address above is the one that will be use for the takeover, so in other words that's where the DNS queries go.
Now you should be able to run DNS queries against the IP configured above.
You can force the IP address to the secondary by running
sudo crm resource migrate dns_ip
and then bring it back with:
sudo crm resource unmigrate dns_ip
If you run
ifconfig on the nodes as you do this you will see a virtual interface show up on the active node, which listens on the virtual IP address.
There are other things you can do as well, like stopping the resource completely, putting a node in standby etc.
I'm pretty pleased at this point with the setup.