lib/selection: fix interaction of timeouts with reboots
We use "monotonic" time-stamps for the dead_since field; that breaks on system reboots, in which case we reset the stats. (if the server was categorized as dead)
If the server times out afterwards, we'd fail the condition
cur_state.consecutive_timeouts == old_state.consecutive_timeouts
so its stats would not update. Therefore we'd get stuck forever
in a state where the unusable server has high priority (no_rtt_info).
This commit changes a bit more than was necessary to fix this, including precision of the stats (in some cases).
Fixes #722 (closed)
Merge request reports
Activity
changed milestone to %5.5.0
- Resolved by Štěpán Balážik
The "interaction" commit is the important one. The "randomness" commit could be dropped or delayed.
- Resolved by Tomas Krizek
CI:
rp:fwd-tls6.udp-asan
is fishy. I can't see how it could be affected by this MR, but the many retries of this pipeline and another on a recent commit would suggest a likely issue here.
added 7 commits
-
2a914dca...44ac0039 - 4 commits from branch
master
- b9c2580e - lib/selection: improve randomness of ties
- 73df71b4 - lib/selection: fix interaction of timeouts with reboots
- 12dff1f9 - fixup! lib/selection: fix interaction of timeouts with reboots
Toggle commit list-
2a914dca...44ac0039 - 4 commits from branch
marked this merge request as draft from 12dff1f9
added 1 commit
- 40f48534 - fixup! lib/selection: fix interaction of timeouts with reboots
- Resolved by Tomas Krizek
In respdiff we can see (example) that sometimes many names from a single NS set end up as SERVFAIL. The rate of this event certainly seems higher than before this MR (according to the graphs).
The cause is unclear to me. It's still only a tiny part of our query-set (like 1/1000 or smaller) and only a single NS set, so I would not block this MR on that, based on what I know at this moment.
mentioned in commit b21b33ba