early detection for dropped answers over TCP connection
Problem
Currently individual DNS queries over TCP connection do not have per-query timer and we leave to TCP stack to handle packet loss. This works fine for network-level problems but does not work for queries dropped at application-level.
Issue seen in the field: #551 I.e. queries are dropped on server side and clients get SERVFAIL once the whole TCP connection times out.
Another instance of this problem is Unbound's default limit for number of queries resolved in parallel over a single TCP connection: Before commit https://github.com/NLnetLabs/unbound/commit/f81d0ac0474cc8904e1240a512b935c8e466f81b Unbound would process only 32 queries in parallel and keep other queries on the same TCP connection hanging, potentially leading to long periods without responses.
Vague proposal
- Use per-query timeout also for queries over TCP/TLS/HTTPS and evaluate if the query should be resent using other transport if it times out.
- Detect "suspicious" TCP connection states when deduplicating connections and skip over "suspicious" connections. For example, do not reuse connection if it has queries hanging on it for longer than 3 seconds. TODO: Is there some other TCP-level tunning we can do?
Related: #447 (closed)