rewrite server selection system
Current server selection mechanism is not well defined, and sometimes exhibits hard-to-debug quirks. This is ticket for collecting ideas what we need from a proper server selection system.
Caveats
-------
- **look for an existing literature about server selection!**
- **forwarding and iteration probably need different algorithms!**
- **what should be the overall criteria?** lowest RTT? reliability? lowest RTT when taking reliability into account? :-)
- can we map this to multi-armed bandit (or some other) model in statistics?
- verify that it is okay to operate with *server == IP address* mapping
- multiple NS names can map to a single IP address
- NS names are probably not significant, properties could be associated with IP addresses
- think about unresolved NS names/incomplete glue
- consider lazy NS name -> IP address resolving if we have enough working servers
- what about anycast nodes with different properties? is it worth considering, or just unsupported configuration? read related RFCs about anycast DNS operation
- server selection probably needs to include *transport protocol* selection for each IP address - UDP, TCP, TLS, DTLS, QUIC, DoH, ...
- some errors (REFUSED, SERVFAIL, ...) are not property of an IP address but in fact are property of (IP address, zone) pair
- e.g. one lame delegation to a name server of big web hosting company should not penalize NS IP address as whole
- transport protocols are likely to have different properties/statistics - RTT, reliability, etc.
- think about TLS-to-auth auto discovery
- how can we incorporate https://tools.ietf.org/html/draft-ietf-dnsop-extended-error draft?
- properties can change over time so our stats need to expire
Ideas for attributes
====================
IP address
----------
- supported EDNS version version (to avoid FORMERR loops, but maybe we need only per-query state ...)
- supported transport protocols (TLS configuration etc.)
- DNS cookies
(IP address, protocol)
----------------------
- RTT
- transport layer "reliability" (maybe timeouts should not be mixed with RTT ...)
- transport protocol information (cached TLS certificate, session resumption, 0-RTT data support, ...)
(IP address, zone)
------------------
- usefulness - ok, SERVFAIL, REFUSED, BOGUS (lame delegations, expired zone data etc.)
Obviously storing (server, zone) attributes might lead to state explosion. We need to think twice about this. Maybe there is a way to optimize, e.g. store only "broken" (server, zone) pairs so we can penalize these during server selection but do not bother with vast majority of "working" pairs.
Assorted ideas
--------------
Serve stale
- timestamp of last attempt
- SERVFAIL a ok per server?
- counters for DoS mitigation (query per zone per server or ...)
issue