SIGBUS on ARM

I think our code may do some unaligned reads/writes. That's OK on x86 IIRC (never causes a problem), but perhaps on some ARM machines it is bad. It's also breakage of C99, I know, it's on TODO.

mentioned in issue #216 (closed)

Steve McIntyre writes:

As a data point, this version of knot-resolver built successfully in Ubuntu:

https://launchpad.net/ubuntu/+source/knot-resolver/3.1.0-1/+build/15694916

Since Ubuntu already builds all its armhf packages on systems running arm64 kernels (and thus have strict handling of alignment issues), if this same source is SIGBUSing in Debian but not Ubuntu it's likely an alignment problem in a build-dependency that Ubuntu has patched.

here's a backtrace from amdahl.deian.org in an armhf chroot with an arm64 kernel:

Program received signal SIGBUS, Bus error.
0xf77497f8 in get_new_ttl (entry=0xf051c962, entry@entry=0xfffef1a0, qry=0x74aa04, qry@entry=0x7498d8, owner=0x74aaff "", type=2, type@entry=0, now=1543619701, now@entry=7645700) at lib/cache/api.c:233
233		int32_t diff = now - entry->time;
(gdb) bt
#0  0xf77497f8 in get_new_ttl (entry=0xf051c962, entry@entry=0xfffef1a0, qry=0x74aa04, qry@entry=0x7498d8, owner=0x74aaff "", type=2, type@entry=0, now=1543619701, now@entry=7645700) at lib/cache/api.c:233
#1  0xf774f21c in check_NS_entry (k=<optimized out>, k=<optimized out>, timestamp=<optimized out>, qry=<optimized out>, is_DS=<optimized out>, exact_match=<optimized out>, i=2, entry=...) at lib/cache/peek.c:703
#2  closest_NS (k=k@entry=0xfffea5b8, el=0xfffea50c, el@entry=0x0, qry=<optimized out>, qry@entry=0x74aa04, only_NS=<optimized out>, only_NS@entry=false, is_DS=false, cache=<optimized out>, cache=<optimized out>)
    at lib/cache/peek.c:637
#3  0xf774f762 in peek_nosync (ctx=0xf75e3400, ctx@entry=0xfffea9b8, pkt=0xf6db0041, pkt@entry=0x7498d8) at lib/cache/peek.c:159
#4  0xf774a5b8 in cache_peek (ctx=0xfffea9b8, pkt=0x7498d8) at lib/cache/api.c:338
#5  0xf775b4c2 in kr_resolve_produce (request=request@entry=0x749798, dst=0x0, dst@entry=0x7498c0, type=0x409ba7 <qr_task_step+154>, type@entry=0xfffeaa24, packet=0x0) at lib/resolve.c:1384
#6  0x00409ba6 in qr_task_step (task=task@entry=0x749890, packet_source=packet_source@entry=0x0, packet=packet@entry=0x700028) at daemon/worker.c:1372
#7  0x0040a734 in worker_resolve_exec (task=task@entry=0x749890, query=query@entry=0x700028) at daemon/worker.c:1697
#8  0x0040cbd4 in wrk_resolve (L=0xf6db21c0) at daemon/bindings.c:1670
#9  0xf75a5abc in ?? () from /usr/lib/arm-linux-gnueabihf/libluajit-5.1.so.2
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

(this is from debian sources, version 3.1.0-1)

I'm happy to debug further in this environment. let me know if you want me to try any specific tests.

the backtrace above comes from running kresd as configured in debian/tests/roundtrip

Ah, yes, this is struct entry_h starting at unaligned address (% 4 == 2 here but it can even be odd on my machine). I've known about it for months; only up to now I haven't seen any case where it caused trouble, so it's been about priorities.

I think I've seen ubsan catching these problems, so I expect to be able to use that on x86 instead of these ARMs – for further testing when fixing this.

This is causing more trouble in other unrelated packages now.

is there anything i can do to move this along? i'd love to see it fixed if possible.

I think we have everything for this except manpower/time. Or too many other priorities relative to that, I might say. And I actually think #216 (closed) is more significant than this issue, though I find it difficult to estimate the overall amount of people affected by such SIGBUSes.

I agree that #216 (closed) is more significant, in that it blocks an entire architecture, not just an architecture+kernel combination.

Hopefully this was fixed in 4.1.0, feel free to reopen if necessary.

closed

Nothing around this was fixed, I believe, except for the other "more problematic" ARM issue.

My brain dump before I forget completely: so far I know only about the issue that data in LMDB is generally not aligned. For entry_h fields themselves there are multiple ways to work around, e.g. by reading these more carefully via memcpy wrappers. For .data we'd need 2-alignment due to libknot functions IIRC. Keys of even length should imply 2-alignment for values in LMDB, but that's undocumented and perhaps it will be better to just allocate a bit longer values and pad their start as necessary.

reopened

@vcunat Is this still relevant? My last walk-though cache code indicated that allignment is being done in kresd code... but I might misremember this.

It's unchanged, but apparently affected hardware is rather rare.

Dear kresd developers,

Debian armhf CI tests are failing due to a SIGBUS (they run on a arm64 kernel): https://ci.debian.net/packages/k/knot-resolver/ See e.g.: https://ci.debian.net/data/autopkgtest/unstable/armhf/k/knot-resolver/11411012/log.gz Unfortunately, these tests are required to make knot-resolver migrate to testing and then to the next stable distribution.

I understand that you have limited manpower, but do you have any plans to fix this in the short term?

I forgot to cross-link; I think it should be fixed by !1167 (merged)

mentioned in merge request !1167 (merged)

Awesome, thanks! Do you have an approximate date to release 5.3.2?

SIGBUS on ARM

Child items

Activity

Admin message

Admin message

SIGBUS on ARM

Child items

Linked items

Related merge requests

Activity