rewrite server selection system
Current server selection mechanism is not well defined, and sometimes exhibits hard-to-debug quirks. This is ticket for collecting ideas what we need from a proper server selection system.
Caveats
-
look for an existing literature about server selection!
-
forwarding and iteration probably need different algorithms!
-
what should be the overall criteria? lowest RTT? reliability? lowest RTT when taking reliability into account? :-)
- can we map this to multi-armed bandit (or some other) model in statistics?
-
verify that it is okay to operate with server == IP address mapping
- multiple NS names can map to a single IP address
- NS names are probably not significant, properties could be associated with IP addresses
- think about unresolved NS names/incomplete glue
- consider lazy NS name -> IP address resolving if we have enough working servers
- what about anycast nodes with different properties? is it worth considering, or just unsupported configuration? read related RFCs about anycast DNS operation
-
server selection probably needs to include transport protocol selection for each IP address - UDP, TCP, TLS, DTLS, QUIC, DoH, ...
-
some errors (REFUSED, SERVFAIL, ...) are not property of an IP address but in fact are property of (IP address, zone) pair
- e.g. one lame delegation to a name server of big web hosting company should not penalize NS IP address as whole
-
transport protocols are likely to have different properties/statistics - RTT, reliability, etc.
-
think about TLS-to-auth auto discovery
-
how can we incorporate https://tools.ietf.org/html/draft-ietf-dnsop-extended-error draft?
-
properties can change over time so our stats need to expire
Ideas for attributes
IP address
- supported EDNS version version (to avoid FORMERR loops, but maybe we need only per-query state ...)
- supported transport protocols (TLS configuration etc.)
- DNS cookies
(IP address, protocol)
- RTT
- transport layer "reliability" (maybe timeouts should not be mixed with RTT ...)
- transport protocol information (cached TLS certificate, session resumption, 0-RTT data support, ...)
(IP address, zone)
- usefulness - ok, SERVFAIL, REFUSED, BOGUS (lame delegations, expired zone data etc.)
Obviously storing (server, zone) attributes might lead to state explosion. We need to think twice about this. Maybe there is a way to optimize, e.g. store only "broken" (server, zone) pairs so we can penalize these during server selection but do not bother with vast majority of "working" pairs.
Assorted ideas
Serve stale
- timestamp of last attempt
- SERVFAIL a ok per server?
- counters for DoS mitigation (query per zone per server or ...)