serve_stale module doesn't provide stale answers when auths are unresponsive
As of version 5.4.2, serve_stale
module doesn't work when auth servers are unresponsive (which is the typical case with network issues). The server selection algorithm tries very hard to resolve the request by re-trying different auth servers and increasing their allowed timeouts, until the request ultimately times out and returns SERVFAIL instead of a stale answer.
If the auth servers are reachable but REFUSE to respond, the serve_stale module works as expected (that was our former test case with deckard).
Some notes about possible resolution:
- to be useful for clients, the stale answer should be provided quickly enough (RFC 8767.5 suggests sending stale answer after 1.8s). The timeout used for serve_stale should ideally be configurable.
- the request resolution should keep going even after the stale answer is sent to the client to refresh data from slower auth severs (possible option: spawn a new duplicate internal request after providing the stale answer?)
- server selection should have a configurable time limit that is respected and allows serve_stale to activate in time
- the server selection time limit shouldn't be used unless serve_stale module is loaded and there is a possible stale answer in the cache