Resolver retransmits too early

Steps to reproduce on an isolated network:

Setup a authoritative on 1.0.0.100 with simulated static (100 ms) latency to resolver with this zone

.			86400	IN	SOA	j.root-servers.net. nstld.verisign-grs.com. 2019072500 1800 900 604800 86400
*.			3600    IN	A	1.1.1.1
.			86400	IN	NS	j.root-servers.net.
j.root-servers.net.	86400	IN	A	1.0.0.100

Start resolver on 1.0.0.1 with these root hints capturing the traffic to PCAP

.                        3600000      NS    J.ROOT-SERVERS.NET.
J.ROOT-SERVERS.NET.      3600000      A     1.0.0.100

Query the resolver with questions to the root zone triggering cache misses (with for i in $(seq 0 9999); do echo "$i A"; done | dnsperf -q 100 -a 2.0.0.1 -s 1.0.0.1. Capture the traffic to a PCAP file.
Observe the PCAP for Destination unreachable (Port unreachable) packets which are results of the early retransmits (as seen in the screenshot from Wireshark below).

These retransmits happen after 110 to 150 ms after the original transmit. This effect is more pronounced if the number of outstanding queries is higher (the -q argument to dnsperf). About 1 % of queries from this test result in retransmit.

Verbose log the query from screenshot show the two queries as well

[00000.00][plan] plan '7251.' type 'A' uid [07251.00]
[07251.00][iter]   '7251.' type 'A' new uid was assigned .01, parent uid .00
[07251.01][cach]   => no NSEC* cached for zone: .
[07251.01][cach]   => skipping zone: ., NSEC, hash 0;new TTL -123456789, ret -2
[07251.01][resl]   => going insecure because there's no covering TA
[07251.01][zcut]   found cut: . (rank 020 return codes: DS -2, DNSKEY -2)
[07251.01][resl]   => id: '24592' querying: '1.0.0.100#00053' score: 100 zone cut: '.' qname: '7251.' qtype: 'A' proto: 'udp'
[07251.01][resl]   => id: '24592' querying: '1.0.0.100#00053' score: 100 zone cut: '.' qname: '7251.' qtype: 'A' proto: 'udp'
[07251.01][iter]   <= rcode: NOERROR
[07251.01][cach]   => stashed 7251. A, rank 020, 20 B total, incl. 0 RRSIGs
[07251.01][resl]   <= server: '1.0.0.100' rtt: 148 ms
[07251.01][resl]   AD: request NOT classified as SECURE
[07251.01][resl]   finished: 4, queries: 1, mempool: 16400 B

Now that I think of it, this is probably caused by the packet being stuck in a buffer somewhere, but I think resolver should cope with this in better ways than to generate more traffic. 110 ms timeout seems too low for a server with 100 ms latency. This is therefore related to #447 (closed).

Edited Apr 02, 2020 by Štěpán Balážik

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information

Admin message

Admin message

Resolver retransmits too early