Skip to content

Failed worker notifications

Vaclav Sraier requested to merge failed_worker_notifications into master

When a worker (kresd or gc fails by itself) we should detect it and react somehow. The simplest reaction we can implement is to log an error and kill the manager. It's also the safest option we can do, so this MR attempts to do just that. The idea is as follows:

  • extend the SubprocessController interface with a register_instability_handler function. Manager would than install a callback into the subprocess controller after its creation.
  • The controller would start a watchdog thread or monitor running workers in some way. When something wrong happens, it would call the given callback.
  • In case of instability, the manager will kill everything. In future, we could change it so that the manager would use existing API in the controller to get the current state of the system and try to fix it, so that the last configuration is followed. If something weird happens again, kill everything.

^ This functionality must be implemented for both service managers supported. systemd supports notifications via DBus, but we must spawn a separate thread for that. supervisord AFAIK does not support notifications and we must poll its state (but we should check it to make sure there is no better way).

Edited by Vaclav Sraier

Merge request reports