Adventures in DNS Resolution
So let begin with the ending: if you have an internal DNS server(s) which you wish
to use with pfSense’s DNS Resolver and Domain Overrides function, you must
include an interface capable of communicating with that server in the
Outgoing Network Interfaces configuration item even if that server is
accessible on another LAN interface. If you don’t you will be unable to resolve
any internal names and you’ll see the RTO of your internal servers slowly creep
up until they cap out at 120000
on the resolver status page. Now, on with the
story.
Why are you doing this?
Mostly to make my home network less frustrating for my wife but also a little bit
for ad blocking. I recently decided to enable pfBlockerNG to do full house ad
removal which requires moving from the DNS Forwarder to the more fully featured
DNS Resolver on pfSense. I also have a cluster of FreeIPA servers which handle
name resolution for all the various machines/services running on my LAN. Unfortunately,
because I wasn’t very smart when I set it all up, these services are running under
a subdomain of another domain I use out on the wider net. So SOA stuff is….screwy.
For those local services I have to make sure my FreeIPA servers get asked DNS
questions first otherwise I’ll get a NXDOMAIN
from the “real” nameservers
out on the net.
Previously, the pfSense router was configured with the DNS Forwarder to look at
each of the FreeIPA servers in order and then at 1.1.1.1
as a last resort when
resolving DNS names. This “worked” but had a whole host of downsides. If any of
the FreeIPA servers went down every request took ages because I had to configure
pfSense to query sequentially to keep my FreeIPA boxes “first” in the lookup list.
This was an improvement over the previous system where everything went through
the FreeIPA servers always meaning that if they went down DNS resolution died
for the whole house. Not great.
As a final hiccup there are certain DNS names that have to be resolvable even if the FreeIPA servers are offline because they are integral in bringing the cluster which hosts FreeIPA online. So those have to be handled somehow, and were previously static host entries on the pfSense DNS Forwarder.
What other questionable choices did you make?
Well my home network isn’t a simple WAN/LAN split either. There’s a wired LAN segment, a segment for my VMs and VM hosts, and a segment for WiFi devices. This is actually pretty neat for stuff like security and keeping rude IoT devices contained, but it can cause problems when you need to reach out to services across a segment boundary.
This had manifested previously with my poor choice of putting the VM NFS host on the LAN segment and the VM host which used it for VM disk storage on the VM segment. This led to really fun events like the router going down causing every VM in my cluster to tip over until force rebooted with possibly corrupted disks. Thankfully, this mistake was recently remedied.
All of this to say that the FreeIPA servers live on the VM segment of my home network and are only accessible from inside my home network.
When did you discover something was wrong?
Since it is approximately the temperature of the sun at home right now, hurray for
dry AZ heat, I wanted to make sure all my weather station stuff was working right.
I went to check on the gateway unit and found that I couldn’t get the name to
resolve. Begin panic. Check the VM cluster, everything seems to be running. Check
the VM consoles, everything is good and services are running as expected. Log into
the FreeIPA console and look up the unit’s static IP and try that, connects just
fine by IP. Try dig @freeipa weather.example.com
and get back the right response.
Try dig @router weather.example.com
and…..nothing, a SRVFAIL
. So we know
where the problem is, now to discover what the problem is.
How did you fix it?
First order of business was to set the local zone type in pfSense to Inform
and the log level to 3
to try and get a better idea of what the heck was going
on. I also pulled up the DNS Resolver status page and took a look at the server
stats that it was reporting. One thing I noticed right away was when Unbound was
restarted and the first query to an internal domain was made the RTO
number for
each of the internal DNS servers started low and steadily climbed till it capped
out at 120000
. Now I can’t seem to find anywhere what the RTO
number actually
is but I’m going to get its like Remote Time Out
and I’ll also note that the
forwarding servers I have configured, 1.1.1.1
and 1.0.0.1
, both have RTO
values in the range of 1-600.
Next I checked the logs and watched the requests get parsed and sent out. I could
see regular requests to the rest of the net exiting my network cleanly and getting
responses back but anything for my internal domain seemed to just hang and go
nowhere. So maybe I managed to firewall myself off from the FreeIPA boxes, unlikely
but since the ad blocker injects a bunch of firewall rules it’s possible. I tried
the dig @freeipa weather.example.com
again but from the router’s console this
time. Instant response with the correct IP. Definitely not a firewall issue then.
At a bit of a loss I went back to read through the DNS Resolver configuration
options in more detail and stumbled upon the key to the whole thing. I had, like
a giant idiot, overridden the default value for Outgoing Network Interfaces
from
all available interfaces to only my two WAN uplinks. Since my FreeIPA servers
aren’t visible from the internet and aren’t on my WAN segment….surprise pfSense
couldn’t communicate with them! Adding my VM segment to the list of allowed
interfaces and restarting the service got everything patched up instantly.
What did you learn?
Read The Furnished Documentation. Beyond that, DNS is hard and tracking network issues takes an iterative approach both in discovery and repair. Don’t be like me, go slow and pay attention to default values. They might just be there for a reason!