Resolver returns SERVFAIL until restarted

You should use setting net.ipv6 = false – that will slightly improve latency and some other behavior, and it will also most likely work around this issue.

Thanks, good hint, I will definitely do that and we will see if that makes a difference. I was actually planning to finally configure IPv6 connectivity, but until then I will disable it explicitly for the resolver.

Hello. We use knot-resolver 5.4.4 in production and we got into the same situation quite recently. Do you think we can expect a proper fix soon or we'll need to learn to live with that "workaround" for some time?

If your resolver does not have working IPv6, you want that setting regardless of this issue. The other improvements won't change anytime soon.

Can confirm this issue persists even when net.ipv6 = false is set. Running version 5.5.0.

Can you get debug logs from the failing moment when it's configured that way?

You can dynamically enable the logs e.g. by

echo "log_level('debug')" | socat - UNIX-CONNECT:/run/knot-resolver/control/1

(for the usual kresd@1.service)

Hello, we are using Knot Resolver version 5.5.0 and we just hit this issue again for a single domain allers.nl, this domain was not resolvable using Knot Resolver, but was resolvable using Google DNS (8.8.8.8), until the Knot Resolver was manually restarted, the resolution always ended with SERVFAIL. I have collected the trace logs.

"[reqdbg][policy][25095.00] following rrsets were marked as interesting:"
"[reqdbg][policy][25095.00] answer packet:
;; ->>HEADER<<- opcode: QUERY; status: SERVFAIL; id: 25095
;; Flags: qr rd ra  QUERY: 1; ANSWER: 0; AUTHORITY: 0; ADDITIONAL: 1
;; EDNS PSEUDOSECTION:
;; Version: 0; flags: ; UDP size: 1232 B; ext-rcode: Unused
;; QUESTION SECTION
allers.nl.      A
;; ADDITIONAL SECTION"
"[reqdbg][iterat][25095.00]   'allers.nl.' type 'A' new uid was assigned .01, parent uid .00"
"[reqdbg][cache ][25095.01]   => trying zone: allers.nl., NSEC, hash 0"
"[reqdbg][cache ][25095.01]   => NSEC sname: range search found stale or insecure entry"
"[reqdbg][cache ][25095.01]   => skipping zone: allers.nl., NSEC, hash 0;new TTL -123456789, ret -2"
"[reqdbg][select][25095.01]   => id: '30610' choosing to resolve A: 'ns13.kpn.net.' zone cut: 'allers.nl.'"
"[reqdbg][zoncut][25095.01]   found cut: allers.nl. (rank 002 return codes: DS 0, DNSKEY 0)"
"[reqdbg][plan  ][25095.01]   plan 'ns13.kpn.net.' type 'A' uid [25095.02]"
"[reqdbg][iterat][25095.02]     'ns13.kpn.net.' type 'A' new uid was assigned .03, parent uid .01"
"[reqdbg][cache ][25095.03]     => skipping exact RR: rank 060 (min. 000), new TTL -67440"
"[reqdbg][cache ][25095.03]     => trying zone: kpn.net., NSEC3, hash 6156003e"
"[reqdbg][cache ][25095.03]     => NSEC3 depth 1: hash 9740ns6llptro0du29ftcv5q28kmsmro"
"[reqdbg][cache ][25095.03]     => NSEC3 encloser error for ns13.kpn.net.: range search found stale or insecure entry"
"[reqdbg][cache ][25095.03]     => NSEC3 encloser error for kpn.net.: range search found stale or insecure entry"
"[reqdbg][cache ][25095.03]     => NSEC3 depth 0: hash gubsebl5t4hm4p6un2l053okuegaumq0"
"[reqdbg][cache ][25095.03]     => skipping zone: kpn.net., NSEC, hash 0;new TTL -123456789, ret -2"
"[reqdbg][zoncut][25095.03]     found cut: kpn.net. (rank 002 return codes: DS 0, DNSKEY -116)"
"[reqdbg][plan  ][25095.03]     plan 'kpn.net.' type 'DNSKEY' uid [25095.04]"
"[reqdbg][resolv][25095.03]     finished in state: 8, queries: 1, mempool: 16400 B"
"[reqdbg][iterat][25095.04]       'kpn.net.' type 'DNSKEY' new uid was assigned .05, parent uid .03"
"[reqdbg][cache ][25095.05]       => skipping exact RR: rank 060 (min. 030), new TTL -69959"
"[reqdbg][cache ][25095.05]       => trying zone: kpn.net., NSEC3, hash 6156003e"
"[reqdbg][cache ][25095.05]       => NSEC3 depth 0: hash gubsebl5t4hm4p6un2l053okuegaumq0"
"[reqdbg][cache ][25095.05]       => NSEC3 encloser error for kpn.net.: range search found stale or insecure entry"
"[reqdbg][cache ][25095.05]       => skipping zone: kpn.net., NSEC, hash 0;new TTL -123456789, ret -2"
"[reqdbg][select][25095.05]       => id: '39189' no suitable transport, zone cut: 'kpn.net.'"
"[reqdbg][iterat][25095.05]       'kpn.net.' type 'DNSKEY' new uid was assigned .06, parent uid .03"
"[reqdbg][select][25095.06]       => id: '57918' no suitable transport, zone cut: 'kpn.net.'"
"[reqdbg][resolv][25095.06]       AD: request NOT classified as SECURE"

I suspect this might be due to them switching nameservers without regards to TTL. The parent-side NS has TTL 2 days, but already now *.kpn.net refuse to answer for this name. This particular log doesn't exactly express why no transport is considered available, unfortunately.

If this guess of mine was right, it would help to decrease the TTL bound on resolver side (though not retrospectively). Currently the default is very generous at 6 days, though I think bigger operators often favor values like a day or even an hour.

Maybe that was a coincidence.

Similarly looking reported log, and it seems that no NS changes happened there: https://lists.nic.cz/hyperkitty/list/knot-resolver-users@lists.nic.cz/message/4ZE325MFMHNA2B3FVZ3RSSH7X66ZUILP/

Here we'll add a bit more logs, affecting also this situation: !1298 (merged)

thanks for the answer, we will consider decreasing TTL of cache on our side

Resolver returns SERVFAIL until restarted

Child items

Activity

Admin message

Admin message

Resolver returns SERVFAIL until restarted

Child items

Linked items

Related merge requests

Activity