Bootstraping issue in 1.3
Hello KNOT folks,
We've found an issue 1.3 with bootstrapping. We're using FreeBSD 9.x, but we also quickly confirmed it exists on Ubuntu 12.x to confirm it was not isolated to FreeBSD. We're testing with about 3000 to 4000 zones, so our environment is not even very large at this point and the bootstrapping failures are very problematic. There are three causes that we've seen thus far:
-
If the AXFR TCP connect is interrupted by a signal, the whole AXFR is aborted and the bootstrap is rescheduled instead of selecting on the socket to either get the successful connection, or until it times out/fails. This can result in a flood of connects, with little to no progress in the bootstrapping.
-
When connected, if a recv() is interrupted by a signal, it isn't retried. This results in connections being dropped that don't need to be dropped.
-
If a successful connect is made, but the remote end subsequently drops it (e.g., resets the connection), then the bootstrap fails without being rescheduled. This was found when slaving from a non-KNOT DNS server that may have TCP rate limiting enabled, or something of that nature. Either way, the fact that it is not rescheduled is very undesirable.
I suspect that there are other cases of interrupted system calls not being handled correctly.
Here is some additional info that may help find the root cause:
-
The greater the latency between the master and slave, the worse the problem is. We tested with a slave 80 ms RTT away and it was very bad.
-
The more worker threads you have, the worse the problem is. So even locally (slave 0 ms away from master) we could reproduce the issue fairly easily.
Hopefully this can be remedied!
Cheers,
Jonathan