SIGBUS on ARM
@dkg wrote: fwiw, i think we're having a problem just running the armhf (32-bit arm with hard-float) build of knot-resolver on top of an arm64 kernel (despite the kernel otherwise running fine with an entirely 32-bit userland). you can see the build logs for knot-resolver on armhf -- the machine named arm-arm-01
is an arm64 kernel and armhf userland, and the test suite was fully re-enabled on all platforms in version 3.0.0-4.
No child items are currently assigned. Use child items to break down this issue into smaller parts.
Link issues together to show that they're related. Learn more.
When this merge request is accepted, this issue will be closed automatically.
Activity
- Author Owner
I think our code may do some unaligned reads/writes. That's OK on x86 IIRC (never causes a problem), but perhaps on some ARM machines it is bad. It's also breakage of C99, I know, it's on TODO.
- Vladimír Čunát mentioned in issue #216 (closed)
mentioned in issue #216 (closed)
- Guest
As a data point, this version of knot-resolver built successfully in Ubuntu:
https://launchpad.net/ubuntu/+source/knot-resolver/3.1.0-1/+build/15694916
Since Ubuntu already builds all its armhf packages on systems running arm64 kernels (and thus have strict handling of alignment issues), if this same source is SIGBUSing in Debian but not Ubuntu it's likely an alignment problem in a build-dependency that Ubuntu has patched.
- Guest
here's a backtrace from amdahl.deian.org in an
armhf
chroot with anarm64
kernel:Program received signal SIGBUS, Bus error. 0xf77497f8 in get_new_ttl (entry=0xf051c962, entry@entry=0xfffef1a0, qry=0x74aa04, qry@entry=0x7498d8, owner=0x74aaff "", type=2, type@entry=0, now=1543619701, now@entry=7645700) at lib/cache/api.c:233 233 int32_t diff = now - entry->time; (gdb) bt #0 0xf77497f8 in get_new_ttl (entry=0xf051c962, entry@entry=0xfffef1a0, qry=0x74aa04, qry@entry=0x7498d8, owner=0x74aaff "", type=2, type@entry=0, now=1543619701, now@entry=7645700) at lib/cache/api.c:233 #1 0xf774f21c in check_NS_entry (k=<optimized out>, k=<optimized out>, timestamp=<optimized out>, qry=<optimized out>, is_DS=<optimized out>, exact_match=<optimized out>, i=2, entry=...) at lib/cache/peek.c:703 #2 closest_NS (k=k@entry=0xfffea5b8, el=0xfffea50c, el@entry=0x0, qry=<optimized out>, qry@entry=0x74aa04, only_NS=<optimized out>, only_NS@entry=false, is_DS=false, cache=<optimized out>, cache=<optimized out>) at lib/cache/peek.c:637 #3 0xf774f762 in peek_nosync (ctx=0xf75e3400, ctx@entry=0xfffea9b8, pkt=0xf6db0041, pkt@entry=0x7498d8) at lib/cache/peek.c:159 #4 0xf774a5b8 in cache_peek (ctx=0xfffea9b8, pkt=0x7498d8) at lib/cache/api.c:338 #5 0xf775b4c2 in kr_resolve_produce (request=request@entry=0x749798, dst=0x0, dst@entry=0x7498c0, type=0x409ba7 <qr_task_step+154>, type@entry=0xfffeaa24, packet=0x0) at lib/resolve.c:1384 #6 0x00409ba6 in qr_task_step (task=task@entry=0x749890, packet_source=packet_source@entry=0x0, packet=packet@entry=0x700028) at daemon/worker.c:1372 #7 0x0040a734 in worker_resolve_exec (task=task@entry=0x749890, query=query@entry=0x700028) at daemon/worker.c:1697 #8 0x0040cbd4 in wrk_resolve (L=0xf6db21c0) at daemon/bindings.c:1670 #9 0xf75a5abc in ?? () from /usr/lib/arm-linux-gnueabihf/libluajit-5.1.so.2 Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(this is from debian sources, version 3.1.0-1)
- Guest
I'm happy to debug further in this environment. let me know if you want me to try any specific tests.
the backtrace above comes from running
kresd
as configured indebian/tests/roundtrip
- Author Owner
Ah, yes, this is
struct entry_h
starting at unaligned address (% 4 == 2 here but it can even be odd on my machine). I've known about it for months; only up to now I haven't seen any case where it caused trouble, so it's been about priorities.I think I've seen ubsan catching these problems, so I expect to be able to use that on x86 instead of these ARMs – for further testing when fixing this.
- Guest
This is causing more trouble in other unrelated packages now.
- Guest
is there anything i can do to move this along? i'd love to see it fixed if possible.
- Author Owner
I think we have everything for this except manpower/time. Or too many other priorities relative to that, I might say. And I actually think #216 (closed) is more significant than this issue, though I find it difficult to estimate the overall amount of people affected by such SIGBUSes.
Edited by Vladimír Čunát - Guest
I agree that #216 (closed) is more significant, in that it blocks an entire architecture, not just an architecture+kernel combination.
- Contributor
Hopefully this was fixed in 4.1.0, feel free to reopen if necessary.
- Petr Špaček closed
closed
- Author Owner
Nothing around this was fixed, I believe, except for the other "more problematic" ARM issue.
My brain dump before I forget completely: so far I know only about the issue that data in LMDB is generally not aligned. For
entry_h
fields themselves there are multiple ways to work around, e.g. by reading these more carefully via memcpy wrappers. For.data
we'd need 2-alignment due to libknot functions IIRC. Keys of even length should imply 2-alignment for values in LMDB, but that's undocumented and perhaps it will be better to just allocate a bit longer values and pad their start as necessary. - Vladimír Čunát reopened
reopened
- Contributor
@vcunat Is this still relevant? My last walk-though cache code indicated that allignment is being done in kresd code... but I might misremember this.
- Author Owner
It's unchanged, but apparently affected hardware is rather rare.
Dear kresd developers,
Debian armhf CI tests are failing due to a SIGBUS (they run on a arm64 kernel): https://ci.debian.net/packages/k/knot-resolver/ See e.g.: https://ci.debian.net/data/autopkgtest/unstable/armhf/k/knot-resolver/11411012/log.gz Unfortunately, these tests are required to make knot-resolver migrate to testing and then to the next stable distribution.
I understand that you have limited manpower, but do you have any plans to fix this in the short term?
- Author Owner
I forgot to cross-link; I think it should be fixed by !1167 (merged)
- Vladimír Čunát mentioned in merge request !1167 (merged)
mentioned in merge request !1167 (merged)
Awesome, thanks! Do you have an approximate date to release 5.3.2?