manager: systemd: improved handling of subprocess crashes

added manager label

assigned to @vsraier

deleted the manager-tooling-fixes branch. This merge request now targets the manager branch

added 9 commits

aba661cd...e70ca7ac - 8 commits from branch manager
e5857cf7 - manager: systemd: better error messages and restart subprocesses always

marked this merge request as ready

@tkrizek @vcunat Following our discussion in #711 (closed), I've replaced Restart=on-abnormal with Restart=always. The reasoning behind it is in the code, but shortly - the manager simply assumes that kresd is running at all times. Setting restart always makes sure it stays true. With this patch, manager shouldn't crash unless kresd crashes repeatedly.

Do you agree with this? If yes, please approve this MR and I will merge it.

From a packaging/systemd perspective, I wouldn't recommend using Restart=always for reasons mentioned in #711 (closed).

But if you're absolutely sure that's the right thing to do when using kresd under the manager component, feel free to ahead. I don't really know the internals of manager.

Just beware that is there's ever any bug in the generated configuration, the kresd instance will be restarted over and over again (until it hits the limit, I believe). Then the service log will be polluted with multiple crashes of the same kind in rapid succession.

Manager assumes the invariant, that the kresd instances never stop unless requested. If there is ever an event when any instance stops, manager unpredictably explodes at unspecified time in the future. I am sadly not aware of any simple and reliable way how to detect unit failures (the best option would probably be setting OnFailure=, but that has it's own set of problems and race conditions).

I've decided for Restart=always because the only downside I am aware of is the repeated log. But in practice, clean exits should never happen under manager and if they do, it's technically an error.

Just beware that is there's ever any bug in the generated configuration, the kresd instance will be restarted over and over again (until it hits the limit, I believe).

Doesn't kresd terminate with a non-zero exit code when there is an error with configuration? I haven't tested it, but I would expect it. And if it's true, then there is no difference between on-failure and always in this case. We would have to use on-abnormal, which as @vcunat discovered is surprising even to knowledgeable people, because it leads to unexpected crashes of manager.

I hope this clears it up a bit. I am quite convinced that Restart=always is the choice making manager most robust for the price of repeated logs, but I think that's just a minor acceptable issue.

Doesn't kresd terminate with a non-zero exit code when there is an error with configuration?

Yes, it should. Consequently I don't expect that the difference between on-abnormal and always would be "interesting" to us. Maybe the only case was me calling quit() by accident.

approved this merge request

I read it before rebase, and looked OK.

merged

mentioned in commit a2c339a5

mentioned in issue #711 (closed)

mentioned in commit 7803cc5b

mentioned in commit 9e185e8a

Admin message

manager: systemd: improved handling of subprocess crashes

Merge request reports

Activity