Knot 1.6.4 failing to refresh a zone
One of our Knot 1.6.4 failed to refresh the ripe.net zone earlier today, and we got alerts. It fixed itself eventually, but we don't know what happened. Here's the sequence:
2015-06-30T08:14:09 info: [ripe.net] NOTIFY, incoming, 193.0.19.190@2049: received serial 1435652042
2015-06-30T08:14:09 info: [ripe.net] refresh, outgoing, 193.0.19.190@53: master has newer serial 1435590662 -> 1435652042
2015-06-30T08:14:09 info: [ripe.net] IXFR, incoming, 193.0.19.190@53: starting
2015-06-30T08:14:10 info: [ripe.net] IXFR, incoming, 193.0.19.190@53: finished, 0.68 seconds, 80 messages, 4805321 bytes
2015-06-30T08:14:21 info: [ripe.net] NOTIFY, incoming, 93.175.159.250@18318: received serial 1435652042
2015-06-30T08:14:21 info: [ripe.net] refresh, outgoing, 93.175.159.250@53: zone is up-to-date
2015-06-30T09:05:17 info: [ripe.net] zone file updated, serial 1435590662 -> 1435652042
2015-06-30T09:18:09 info: [ripe.net] NOTIFY, incoming, 193.0.19.190@2049: received serial 1435655882
2015-06-30T09:18:24 info: [ripe.net] NOTIFY, incoming, 93.175.159.250@4707: received serial 1435655882
2015-06-30T09:41:39 info: remote control, received command 'refresh ripe.net.'
2015-06-30T09:59:42 info: remote control, received command 'zonestatus'
2015-06-30T10:03:01 info: [ripe.net] NOTIFY, incoming, 193.0.19.190@2049: received serial 1435655883
2015-06-30T10:05:04 info: [ripe.net] NOTIFY, incoming, 93.175.159.250@53845: received serial 1435655883
2015-06-30T10:05:08 info: [ripe.net] refresh, outgoing, 93.175.159.250@53: master has newer serial 1435652042 -> 1435655883
2015-06-30T10:05:08 info: [ripe.net] IXFR, incoming, 93.175.159.250@53: starting
2015-06-30T10:05:08 notice: [ripe.net] journal is full, flushing
2015-06-30T10:05:08 warning: [ripe.net] IXFR, incoming, 93.175.159.250@53: failed to write changes to journal (requested resource is busy)
2015-06-30T10:05:09 notice: [ripe.net] IXFR, incoming, 93.175.159.250@53: fallback to AXFR
2015-06-30T10:05:09 info: [ripe.net] AXFR, incoming, 93.175.159.250@53: starting
2015-06-30T10:05:09 info: [ripe.net] AXFR, incoming, 93.175.159.250@53: finished, serial 1435652042 -> 1435655883, 0.18 seconds, 52 messages, 2936689 bytes
2015-06-30T10:05:17 info: [ripe.net] zone file updated, serial 1435652042 -> 1435655883
Now note this: there were a pair of notifies, at 09:18:09 and 09:18:24, but knot did not refresh the zone. This is when we got the alert that ripe.net was out of sync. My colleague ran "knotc refresh ripe.net" at 09:41:39, but it did not fix the problem. When I ran "knotc zonestatus" to check, I noticed it was still in "refresh pending" state.
# knotc zonestatus |grep ^ripe.net
ripe.net. type=slave | serial=1435652042 | refresh pending | DNSSEC signing disabled
But knot sat like this, and did not refresh the zone, even though it was happily refreshing many other zones.
Eventually, a new notify came in, and somehow woke knot up, and it refreshed ripe.net. So we no longer have the error condition. However, knot appears to have gotten stuck, or forgotten about refreshing ripe.net for nearly a hour.
Any ideas?