Knot 2.3.1 fails to load a zone whose zone file is missing
I have found a very subtle bug in Knot 2.3.1. Please review the following sequence very carefully, because the sequence is important. First, I start with a completely fresh Knot setup. My storage dir is /var/knot, and is empty. I have 2 zones configured, and I start Knot:
2016-10-14T15:04:32 info: loading 2 zones
2016-10-14T15:04:32 info: [nro.net] zone will be bootstrapped
2016-10-14T15:04:32 info: [nro.org] zone will be bootstrapped
2016-10-14T15:04:32 info: starting server
2016-10-14T15:04:32 info: [nro.org] AXFR, incoming, 193.0.19.190@53: starting
2016-10-14T15:04:32 info: [nro.net] AXFR, incoming, 193.0.19.190@53: starting
2016-10-14T15:04:32 info: [nro.org] AXFR, incoming, 193.0.19.190@53: finished, serial 1460026934, 0.00 seconds, 1 messages, 2607 bytes
2016-10-14T15:04:32 info: [nro.net] AXFR, incoming, 193.0.19.190@53: finished, serial 1469614941, 0.00 seconds, 1 messages, 3947 bytes
2016-10-14T15:04:32 info: [nro.org] zone file updated, serial 1460026934
2016-10-14T15:04:32 info: [nro.net] zone file updated, serial 1469614941
2016-10-14T15:04:32 info: server started in the foreground, PID 20631
2016-10-14T15:04:32 info: control, binding to '/var/knot/knot.sock'
Note that Knot has bootstrapped 2 zones, AXFRed them, and written them into zone files. Now I stop Knot:
2016-10-14T15:05:32 info: stopping server
2016-10-14T15:05:32 info: updating zone timers database
2016-10-14T15:05:32 info: shutting down
Knot has updated the timers database, and shut down. I now have this in the storage dir:
# ls -l
total 20
-rw-rw---- 1 knot knot 6424 Oct 14 15:04 nro.net.zone
-rw-rw---- 1 knot knot 4203 Oct 14 15:04 nro.org.zone
drwxrwx--- 2 knot knot 4096 Oct 14 15:04 timers
Now, suppose I remove one of these zone files:
rm nro.org.zone
And then start Knot. Note the log carefully:
2016-10-14T15:07:50 info: Knot DNS 2.3.1 starting
2016-10-14T15:07:50 info: binding to interface '::1@53'
2016-10-14T15:07:50 info: binding to interface '127.0.0.1@53'
2016-10-14T15:07:50 info: binding to interface '2001:67c:2e8:11::c100:13bf@53'
2016-10-14T15:07:50 info: binding to interface '193.0.19.191@53'
2016-10-14T15:07:50 info: changing GID to '10073'
2016-10-14T15:07:50 info: changing UID to '10073'
2016-10-14T15:07:50 info: loading 2 zones
2016-10-14T15:07:50 info: [nro.net] zone will be loaded
2016-10-14T15:07:50 info: [nro.org] zone will be bootstrapped
2016-10-14T15:07:50 info: starting server
2016-10-14T15:07:50 info: [nro.net] loaded, serial 1469614941
2016-10-14T15:07:50 info: [nro.net] refresh, outgoing, 193.0.19.190@53: zone is up-to-date
2016-10-14T15:07:50 info: server started in the foreground, PID 21165
2016-10-14T15:07:50 info: control, binding to '/var/knot/knot.sock'
Now Knot has started, and loaded nro.net. But it didn't find the zone file for nro.org, so it is marked for bootstrap. However, Knot doesn't contact the master for the zone. A query for nro.org returns SERVFAIL. At this point, Knot appears to be stuck. It never queries the master, and never AXFRs the zone, and is in a permanently broken state. I suspect it has to do with the timer database. If I stop knot, delete the "timers" directory, and start it, the nro.org zone DOES get loaded.
Therefore my conclusion so far is that this subtle bug has to do with a combination of the timers and boostrapping a zone.