Resolver retransmits too early
Steps to reproduce on an isolated network:
- Setup a authoritative on 1.0.0.100 with simulated static (100 ms) latency to resolver with this zone
. 86400 IN SOA j.root-servers.net. nstld.verisign-grs.com. 2019072500 1800 900 604800 86400
*. 3600 IN A 1.1.1.1
. 86400 IN NS j.root-servers.net.
j.root-servers.net. 86400 IN A 1.0.0.100
- Start resolver on 1.0.0.1 with these root hints capturing the traffic to PCAP
. 3600000 NS J.ROOT-SERVERS.NET.
J.ROOT-SERVERS.NET. 3600000 A 1.0.0.100
-
Query the resolver with questions to the root zone triggering cache misses (with
for i in $(seq 0 9999); do echo "$i A"; done | dnsperf -q 100 -a 2.0.0.1 -s 1.0.0.1
. Capture the traffic to a PCAP file. -
Observe the PCAP for
Destination unreachable (Port unreachable)
packets which are results of the early retransmits (as seen in the screenshot from Wireshark below).
These retransmits happen after 110 to 150 ms after the original transmit. This effect is more pronounced if the number of outstanding queries is higher (the -q
argument to dnsperf
). About 1 % of queries from this test result in retransmit.
Verbose log the query from screenshot show the two queries as well
[00000.00][plan] plan '7251.' type 'A' uid [07251.00]
[07251.00][iter] '7251.' type 'A' new uid was assigned .01, parent uid .00
[07251.01][cach] => no NSEC* cached for zone: .
[07251.01][cach] => skipping zone: ., NSEC, hash 0;new TTL -123456789, ret -2
[07251.01][resl] => going insecure because there's no covering TA
[07251.01][zcut] found cut: . (rank 020 return codes: DS -2, DNSKEY -2)
[07251.01][resl] => id: '24592' querying: '1.0.0.100#00053' score: 100 zone cut: '.' qname: '7251.' qtype: 'A' proto: 'udp'
[07251.01][resl] => id: '24592' querying: '1.0.0.100#00053' score: 100 zone cut: '.' qname: '7251.' qtype: 'A' proto: 'udp'
[07251.01][iter] <= rcode: NOERROR
[07251.01][cach] => stashed 7251. A, rank 020, 20 B total, incl. 0 RRSIGs
[07251.01][resl] <= server: '1.0.0.100' rtt: 148 ms
[07251.01][resl] AD: request NOT classified as SECURE
[07251.01][resl] finished: 4, queries: 1, mempool: 16400 B
Now that I think of it, this is probably caused by the packet being stuck in a buffer somewhere, but I think resolver should cope with this in better ways than to generate more traffic. 110 ms timeout seems too low for a server with 100 ms latency. This is therefore related to #447 (closed).