IXFR crashes Knot DNS 3.2.8
This morning, we updated all our server to Knot DNS 3.2.8. Soon after, a bunch of them crashed simultaneously. I don't know if the crash is related to the update, or would have happened anyway, but here are some details. These servers are all secondaries for a zone 19.139.in-addr.arpa. RIPE NCC is not the primary for this zone. It is a customer zone. We AXFR it from the customer to our BIND XFR server, and then propagate it to our publication servers. Below, you can see the recent update history of the zone:
2023-06-27T09:36:50+0000 info: [19.139.in-addr.arpa.] notify, incoming, remote 2001:67c:2e8:11::c100:13be@57916, serial 1687858608
2023-06-27T09:36:50+0000 info: [19.139.in-addr.arpa.] refresh, remote 2001:67c:2e8:11::c100:13be@53, remote serial 1687858608, zone is outdated
2023-06-27T09:36:50+0000 info: [19.139.in-addr.arpa.] IXFR, incoming, remote 2001:67c:2e8:11::c100:13be@53, started
2023-06-27T09:36:50+0000 info: [19.139.in-addr.arpa.] IXFR, incoming, remote 2001:67c:2e8:11::c100:13be@53, finished, 0.00 seconds, 1 messages, 325 bytes
2023-06-27T09:36:50+0000 info: [19.139.in-addr.arpa.] refresh, remote 2001:67c:2e8:11::c100:13be@53, zone updated, 0.01 seconds, serial 1687851933 -> 1687858608, expires in 3600000 seconds
2023-06-27T09:36:51+0000 info: [19.139.in-addr.arpa.] notify, incoming, remote 2001:67c:2d7c:66::53@48060, serial 1687858608
2023-06-27T10:07:50+0000 info: [19.139.in-addr.arpa.] notify, incoming, remote 2001:67c:2e8:11::c100:13be@43429, serial 1687860468
2023-06-27T10:07:50+0000 info: [19.139.in-addr.arpa.] refresh, remote 2001:67c:2e8:11::c100:13be@53, remote serial 1687860468, zone is outdated
2023-06-27T10:07:50+0000 info: [19.139.in-addr.arpa.] IXFR, incoming, remote 2001:67c:2e8:11::c100:13be@53, started
2023-06-27T10:07:50+0000 info: [19.139.in-addr.arpa.] IXFR, incoming, remote 2001:67c:2e8:11::c100:13be@53, finished, 0.00 seconds, 1 messages, 324 bytes
2023-06-27T10:07:51+0000 info: [19.139.in-addr.arpa.] refresh, remote 2001:67c:2e8:11::c100:13be@53, zone updated, 0.09 seconds, serial 1687858608 -> 1687860468, expires in 3600000 seconds
2023-06-27T10:07:51+0000 info: [19.139.in-addr.arpa.] notify, incoming, remote 2001:67c:2d7c:66::53@54797, serial 1687860468
2023-06-27T10:14:31+0000 info: [19.139.in-addr.arpa.] notify, incoming, remote 2001:67c:2e8:11::c100:13be@56393, serial 1687860869
2023-06-27T10:14:31+0000 info: [19.139.in-addr.arpa.] refresh, remote 2001:67c:2e8:11::c100:13be@53, remote serial 1687860869, zone is outdated
2023-06-27T10:14:31+0000 info: [19.139.in-addr.arpa.] IXFR, incoming, remote 2001:67c:2e8:11::c100:13be@53, started
2023-06-27T10:14:31+0000 info: [19.139.in-addr.arpa.] IXFR, incoming, remote 2001:67c:2e8:11::c100:13be@53, finished, 0.00 seconds, 1 messages, 288 bytes
2023-06-27T10:14:31+0000 info: [19.139.in-addr.arpa.] refresh, remote 2001:67c:2e8:11::c100:13be@53, zone updated, 0.01 seconds, serial 1687860468 -> 1687860869, expires in 3600000 seconds
2023-06-27T10:14:32+0000 info: [19.139.in-addr.arpa.] notify, incoming, remote 2001:67c:2d7c:66::53@36426, serial 1687860869
2023-06-27T11:05:16+0000 info: [19.139.in-addr.arpa.] notify, incoming, remote 2001:67c:2e8:11::c100:13be@34685, serial 1687863911
2023-06-27T11:05:16+0000 info: [19.139.in-addr.arpa.] refresh, remote 2001:67c:2e8:11::c100:13be@53, remote serial 1687863911, zone is outdated
2023-06-27T11:05:16+0000 info: [19.139.in-addr.arpa.] IXFR, incoming, remote 2001:67c:2e8:11::c100:13be@53, started
2023-06-27T11:05:16+0000 info: [19.139.in-addr.arpa.] IXFR, incoming, remote 2001:67c:2e8:11::c100:13be@53, finished, 0.12 seconds, 224 messages, 3907575 bytes
At this point, the server crashes, and this is what systemd reports:
# systemctl status knot
● knot.service - Knot DNS Server
Loaded: loaded (/usr/lib/systemd/system/knot.service; enabled; vendor preset: disabled)
Active: failed (Result: signal) since Tue 2023-06-27 11:05:19 UTC; 1h 32min ago
Process: 151750 ExecReload=/bin/kill -HUP $MAINPID (code=exited, status=0/SUCCESS)
Process: 121137 ExecStart=/usr/sbin/knotd (code=killed, signal=SEGV)
Main PID: 121137 (code=killed, signal=SEGV)
The interesting thing to note is that the latest IXFR is quite large.
When I started Knot again, it happily loaded the problem zone, and began serving it with the latest serial:
2023-06-27T12:04:08+0000 info: [19.139.in-addr.arpa.] zone will be loaded
2023-06-27T12:04:10+0000 info: [19.139.in-addr.arpa.] zone loaded from journal, serial 1687860869
2023-06-27T12:04:10+0000 info: [19.139.in-addr.arpa.] loaded, serial none -> 1687860869, 859762 bytes, expires in 3584485 seconds
Before I restart the rest of the failed servers, I'd like to gather as much data as possible for you to debug. We are not in any emergency situation, because we have other servers fulfilling the capacity.
Just for comparison, our NSD servers in the cluster saw this:
[2023-06-27 11:05:46.990] nsd[9129]: info: zone 19.139.in-addr.arpa. received update to serial 1687863911 at 2023-06-27T11:05:21 from 2001:67c:2e8:11::c100:13be TSIG verified with key main.ripe.net of 3904849 bytes in 0.125703 seconds
[2023-06-27 11:06:08.721] nsd[7825]: info: zone 19.139.in-addr.arpa. serial 1687860869 is updated to 1687863911
So they received a large XFR, and seem to have processed it without any problem.