Resolver returns SERVFAIL until restarted
I am using knot-resolver 5.4.4-cznic.1 on Debian 10. After some (rather long) time, the resolver starts to return SERVFAIL for some records (those secured by DNSSEC).
From what I was able to find, I believe I stumbled upon a bug which might be related to following issues:
It can be remediated quickly just by restarting the kresd service, which makes me thing if this is an issue in the resolver or rather in the Debian packaging (missing some restart hooks?).
From the log (full log attached) I can see:
- There are several failed attempts to refresh trust anchors
[taupd ] active refresh failed for . with rcode: 2
- After a few days (when the cache expires?) the problem starts to manifest itself and resolver starts to respond with SERVFAIL
[plan ][00000.00] plan 'haproxy.luffy.cx.' type 'A' uid [17896.00]
[iterat][17896.00] 'haproxy.luffy.cx.' type 'A' new uid was assigned .01, parent uid .00
[cache ][17896.01] => skipping exact RR: rank 060 (min. 030), new TTL -155800
[cache ][17896.01] => skipping unfit NS RR: rank 002, new TTL -76600
[cache ][17896.01] => skipping unfit NS RR: rank 002, new TTL -81800
[cache ][17896.01] => trying zone: ., NSEC, hash 0
[cache ][17896.01] => NSEC sname: range search miss (!covers)
[cache ][17896.01] => skipping zone: ., NSEC, hash 0;new TTL -123456789, ret -2
[zoncut][17896.01] found cut: . (rank 060 return codes: DS -2, DNSKEY -116)
[resolv][17896.01] >< TA: '.'
[plan ][17896.01] plan '.' type 'DNSKEY' uid [17896.02]
[iterat][17896.02] '.' type 'DNSKEY' new uid was assigned .03, parent uid .01
[cache ][17896.03] => skipping exact RR: rank 060 (min. 030), new TTL -5783
[cache ][17896.03] => trying zone: ., NSEC, hash 0
[cache ][17896.03] => NSEC sname: match but failed type check
[cache ][17896.03] => skipping zone: ., NSEC, hash 0;new TTL -123456789, ret -2
[select][00000.00] NO6: is KO [exploit]
[select][17896.03] => id: '28780' choosing: 'i.root-servers.net.'@'2001:7fe::53#00053' with timeout 10000 ms zone cut: '.'
[resolv][17896.03] => id: '28780' querying: 'i.root-servers.net.'@'2001:7fe::53#00053' zone cut: '.' qname: '.' qtype: 'DNSKEY' proto: 'tcp'
[worker][17896.03] => connecting to: '2001:7fe::53#00053'
[select][17896.03] NO6: timed out, but bad already
[select][17896.03] => id: '28780' noting selection error: 'i.root-servers.net.'@'2001:7fe::53#00053' zone cut: '.' error: 3 TCP_CONNECT_FAILED
[iterat][17896.03] '.' type 'DNSKEY' new uid was assigned .04, parent uid .01
[select][00000.00] NO6: is KO [exploit]
[select][17896.04] => id: '17180' choosing: 'm.root-servers.net.'@'2001:dc3::35#00053' with timeout 10000 ms zone cut: '.'
[resolv][17896.04] => id: '17180' querying: 'm.root-servers.net.'@'2001:dc3::35#00053' zone cut: '.' qname: '.' qtype: 'DNSKEY' proto: 'udp'
- After restarting the service via
systemctl restart kresd@1
the problem instantly disappears
It seems to me like the resolver lost all root servers and needs a restart to reload them. Also, it might be good to mention, there is no IPv6 connectivity on the machine with the resolver.
I am not really sure, how to reproduce without waiting for a couple of days/weeks. This time, the issue appeared after 23 days.
Full log: kresd.log