crash after AXFR
knot-2.7.2 crashed 2 times after an AXFR message. Context is quite terse, and the crashes are distinct, which (if confirmed) could suggest a memory issue. I don't have backtraces, and this specific server is a master, so I cannot gather info with gdb, for example.
First crash (note negative time after "finished"):
Sep 18 10:13:02 knot[2730]: info: refresh, outgoing, xxx@53: remote serial 2018091802, zone is outdated
Sep 18 10:13:02 knot[2730]: info: IXFR, incoming, xxx@53: receiving AXFR-style IXFR
Sep 18 10:13:02 knot[2730]: info: AXFR, incoming, xxx@53: starting
Sep 18 10:13:02 knot[2730]: info: AXFR, incoming, xxx@53: finished, -7717964.48 seconds, 1 messages, 320 bytes
Sep 18 10:13:02 knot[2730]: info: refresh, outgoing, xxx@53: zone updated, serial 2018091801 -> 2018091802
Sep 18 10:13:02 knot[2730]: info: zone file updated, serial 2018091801 -> 2018091802
Sep 18 10:13:02 kernel: [7717836.767581] traps: knotd[2751] general protection ip:7f2c3452a5a6 sp:7f2c0c963cb0 error:0 in liburcu.so.6.0.0[7f2c34527000+6000]
This one seems to be inside glibc's malloc, at 0x94f57
:
94f1b: 48 8b 44 24 08 mov 0x8(%rsp),%rax
94f20: 48 39 05 a9 f3 34 00 cmp %rax,0x34f3a9(%rip) # 3e42d0 <mp_+0x50>
94f27: 0f 86 ff fe ff ff jbe 94e2c <__libc_malloc+0x6c>
94f2d: 64 48 8b 4d 00 mov %fs:0x0(%rbp),%rcx
94f32: 48 85 c9 test %rcx,%rcx
94f35: 0f 84 f1 fe ff ff je 94e2c <__libc_malloc+0x6c>
94f3b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
94f40: 48 8d 34 c1 lea (%rcx,%rax,8),%rsi
94f44: 48 8b 56 40 mov 0x40(%rsi),%rdx
94f48: 48 85 d2 test %rdx,%rdx
94f4b: 0f 84 db fe ff ff je 94e2c <__libc_malloc+0x6c>
94f51: 48 83 f8 3f cmp $0x3f,%rax
94f55: 77 19 ja 94f70 <__libc_malloc+0x1b0>
* 94f57: 48 8b 3a mov (%rdx),%rdi
94f5a: 48 89 7e 40 mov %rdi,0x40(%rsi)
94f5e: 80 2c 01 01 subb $0x1,(%rcx,%rax,1)
94f62: 48 83 c4 18 add $0x18,%rsp
94f66: 48 89 d0 mov %rdx,%rax
94f69: 5b pop %rbx
94f6a: 5d pop %rbp
94f6b: c3 retq
Second crash:
Sep 18 16:38:53 knot[18092]: info: control, received command 'zone-retransfer'
Sep 18 16:38:54 knot[18092]: info: AXFR, incoming, xxx@53: starting
Sep 18 16:38:54 kernel: [7740988.047323] traps: knotd[18099] general protection ip:7fa190ec0f57 sp:7fa16a6fb290 error:0 in libc-2.27.so[7fa190e2c000+1e0000]
Just after this crash, I've noticed we didn't have symbols for uRCU, so I recompiled the exact same release with debugging symbols. While this is not absolutely certain to yield the same layout, in my experience, it reproduces well enough as long as all tools involved are the same (i.e., libs, compilers, etc).
So here is a possible location of the crash, in call_rcu_thread
at 0x35a6
:
3558: e8 53 e6 ff ff callq 1bb0 <synchronize_rcu_memb@plt>
355d: 48 8b 44 24 30 mov 0x30(%rsp),%rax
3562: 48 85 c0 test %rax,%rax
3565: 0f 84 bd 00 00 00 je 3628 <call_rcu_thread+0x2a8>
356b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
3570: 31 c0 xor %eax,%eax
3572: 48 8b 4c 24 30 mov 0x30(%rsp),%rcx
3577: 48 85 c9 test %rcx,%rcx
357a: 75 18 jne 3594 <call_rcu_thread+0x214>
357c: 83 c0 01 add $0x1,%eax
357f: 83 f8 09 cmp $0x9,%eax
3582: 0f 8f b8 00 00 00 jg 3640 <call_rcu_thread+0x2c0>
3588: f3 90 pause
358a: 48 8b 4c 24 30 mov 0x30(%rsp),%rcx
358f: 48 85 c9 test %rcx,%rcx
3592: 74 e8 je 357c <call_rcu_thread+0x1fc>
3594: 4c 8b 39 mov (%rcx),%r15
3597: 4d 85 ff test %r15,%r15
359a: 0f 84 b0 01 00 00 je 3750 <call_rcu_thread+0x3d0>
35a0: 45 31 e4 xor %r12d,%r12d
35a3: 48 89 cf mov %rcx,%rdi
* 35a6: ff 51 08 callq *0x8(%rcx)
35a9: 49 83 c4 01 add $0x1,%r12
35ad: 4d 85 ff test %r15,%r15
35b0: 0f 84 ea fe ff ff je 34a0 <call_rcu_thread+0x120>
Is there any more information I could provide? Knot is running on a Slackware64-14.2 with updates:
- glibc-2.27
- uRCU-0.10.1
- lmdb-0.9.14
- libmaxminddb-1.0.2
- libedit-20170329_3.1
- Linux kernel 4.15.15
Nothing special was used in Knot's configure; only directories were set to accomodate for Slackware's hierarchy.