When DNS is unavailable the internet always seems down. And sometimes it’s just really really slow waiting for timeouts. The other day the primary DNS server in my house failed and I once again thought that there had to be a better way. Since I already have a DNS slave I started hitting google land for ideas like anycast and load balancing. Lot’s of research later I found that I essentially had 3 choices:
- Setup resolve.conf to point at both servers using the
rotate
andtimeout=0.5
options - Setup a fully redundant load balanced setup via LVS+Keepalived
- Setup a HA IP takeover system that will always make sure that DNS is answering on the primary IP
Based on the title of the post you can probably guess which way I went, but before I get to it I’ll share why I didn’t go the other two routes.
The timeout
option seems attractive and simple at first. Unfortunately I use DHCP to configure most of my systems and there doesn’t seem to be a way to push those configuration options to the clients.
Now the keepalived option really appeals to the geek in me, but it ends up going deep into the network routing realm. That doesn’t scare me, but most of the docs I’ve found talk about running Linux boxes with multiple interfaces and segmented internal and external networks. Again, not that scary, but way too much effort for my small home network. Now, there are ways that it can be setup using LVS-NAT on one network, but that has the downside that the DNS servers need to change their default route to go through the LVS system. Well the boxes do double duty and so I back them up and didn’t want to run all that backup traffic through the keepalived+LVS box.
So I was left with IP takeover approach and so far it seems to work splendid. Here is what I did based on the notes found at SETTING UP IP FAILOVER WITH HEARTBEAT AND PACEMAKER ON UBUNTU LUCID and Debian Lenny HowTo from clusterlabs. The instructions are ubuntu centric but should be easily adaptable for other Linux versions.
First install the needed software on both DNS servers:
sudo apt-get install heartbeat pacemaker
Next setup the heartbeat HA configuration by creating a the config file named /etc/ha.d/ha.cf with the following contents:
autojoin none bcast eth2 warntime 3 deadtime 6 initdead 60 keepalive 1 node black node glades crm respawn
Note: make sure you use the right interface on each DNS box as the bcast
setting and use the hostnames for the DNS servers in the node
setting. In my case the primary is named black and the secondary glades
The next step is creating a an authentication key for heartbeat.
sudo sh -c '(echo -ne "auth 1\\n1 sha1 "; dd if=/dev/urandom bs=512 count=1 | openssl md5 ) > /etc/ha.d/authkeys'
Then copy both /etc/ha.d/authkeys and /etc/ha.d/ha.cf to the other node and set the permissions on the /etc/ha.d/authkeys file and finally starting the heartbeat service.
sudo scp /etc/ha.d/* glades:/etc/ha.d/ sudo chmod 0600 /etc/ha.d/authkeys ssh glades sudo chmod 0600 /etc/ha.d/authkey sudo /etc/init.d/heartbeat start ssh glades sudo /etc/init.d/heartbeat start
Note: Make sure the bcast
setting is correct on the sister node. In my case the systems are using different interfaces.
To check all went well try:
[~]# sudo crm status | grep -i online Online: [ glades black ]
Next you’ll setup pacemaker, by disabling stonith, configuring the failover IP, setting the preferred server and configure the monitoring to make the failover happen.
The crm
command will be used again:
[~]# sudo crm configure crm(live)configure# show crm(live)configure# property stonith-enabled=false crm(live)configure# primitive dns_ip IPaddr params ip="192.168.1.240" cidr_netmask="255.255.255.0" crm(live)configure# location dns_ip_pref dns_ip 100: black crm(live)configure# monitor dns_ip 40s:20s crm(live)configure# primitive rndc_reload ocf:heartbeat:anything params binfile="/usr/sbin/rndc" cmdline_options="reload" crm(live)configure# group VIP_rndc dns_ip rndc_reload meta target-role="Started" crm(live)configure# commit crm(live)configure# exit bye
EDIT: Turns out the IP takeover was only part of the solution. As it seems Bind9 doesn’t bind itself to *0.0.0.0* and instead binds to a specific IP. Thus when the IP is taken over bind must be reloaded to start listening and answering on the new IP! The above crm commands are augmented to include a call toe rndc reload
Note: The IP address above is the one that will be use for the takeover, so in other words that’s where the DNS queries go.
Now you should be able to run DNS queries against the IP configured above.
You can force the IP address to the secondary by running
sudo crm resource migrate dns_ip
and then bring it back with:
sudo crm resource unmigrate dns_ip
If you run ifconfig
on the nodes as you do this you will see a virtual interface show up on the active node, which listens on the virtual IP address.
There are other things you can do as well, like stopping the resource completely, putting a node in standby etc.
I’m pretty pleased at this point with the setup.
\\@matthias
Another option to solve the bind9 bind issue, is to include the bind9 service in the cluster configuration and add resource constraints to start it after the IP fail-over has completed.
I love this post. Infact i have the very same setup in my ISP. We are runing Ubuntu 14.4. I am trying to get it working for 20.4 but doesnt seem to work. Do you have any advice?
Hi Craig, glad this was useful to you. To be honest, after I upgraded the servers a couple of times, I found things became less stable than I liked. I ended up tearing it down, since it became more troublesome than it was worth. This was all in my home network and so perhaps by budget gear was to blame. If I were to rework this, I would likely implement something using Kubernetes, which has some great features around resilience and coordination. I’ve had great luck with that Kubernetes in my professional life and some new automation bits make it pretty reasonable to deploy and run. I’m guessing you’re running your own hardware and if you were to consider trying Kubernetes, You can also easily test things by using something like Minikube or MicroK8s. If you’re deploying on your own servers some thing kubespray has worked well for me (I assume as an ISP you’re not leveraging public cloud resources, for which I can speak highly of Kubernetes Operations (kops). Hope this helps. (I do offer consulting services, in case you’re looking for a hand.)
Thank you for your quick response.
I have managed to find out, the latest copy of Ubuntu 20.4 now uses corosync as default and its packaged together when your install apt-get install pacemaker.
So instead of using heartbeat with pacemaker i am now using Corosync with pacemaker. I have written a quick guide if you want it?
Sorry for falling silent, Craig. Yes, I would love to see the guide. I might rework some things to bring the redundant DNS back. 🙂
Hi Matt,
Please see link below. I referenced your blog also.
https://blog2.craigduff.co.uk/bind9-with-cluster-crm-pacemaker-corosync/
Thank you, Craig!