Self sign-up has been disabled due to increased spam activity. If you want to get access, please send an email to a project owner (preferred) or at gitlab(at)nic(dot)cz. We apologize for the inconvenience.
can't reliably fetch stats when using SO_REUSEPORT
I'm using knot resolver with systemd, and want to use the stats module + http module to fetch stats in prometheus format.
My problem is that if I start more that one instance (kresd@1, kresd@2, …), stats fetching requests are distributed among the instances and returns only the stats from the answering instance.
I can't get a reliable way to fetch the stats in such configuration.
Workaround:
I can fetch and aggregate individual workers stats from the controls sockets, but the control socket is very unreliable (it is not able to properly parse 2 successives queries properly and often try to interpret them as a single query).
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items
0
Show closed items
No child items are currently assigned. Use child items to break down this issue into smaller parts.
Linked items
0
Link issues together to show that they're related.
Learn more.
Hi @jddupas and welcome. You are tight that SO_REUSEPORT makes HTTP endpoint unsuitable for gathering stats, we will think about it. In meanwhile let's have a look at the control socket.
Please post the code you use to gather stats from control socket, we will investigate where could be a problem.
@amrazek This is something to consider in your future work.
Some times, in the knot-resolver side, libuv will only report one event containing the 2 queries (that is call tty_process_input with a single buffer containing "__binary\nstats.list()\n")
Unfortunately, tty_process_input strongly assumes that a buffer contains a single whole query (which look rather unsafe knowing how low-level IO routines have the bad habit to rarely truncate buffer where we expect them to do it).
I tried to mitigate it by trying to disable socket buffer and add flush on the python side, but I'm not sure these settings are properly honored.
Right. The module for prometheus only tries to handle this when multiple instances are started via the -f parameter and not separately (which is what we do in our systemd packages).
Thanks for finding this problem with delimitation of commands in the control socket
Here is a workaround: Do it the other way around and write out from the instances elsewhere, e.g. to a per-instance file, and then read it from there. Yes, it is ugly, sorry about that!
Example Lua code suitable for kresd config:
modules.load('stats')-- determine own pidlocal statfile = io.open('/proc/self/stat')local mypid = tostring(statfile:read('*n'))statfile:close()statfile = nil-- write current statslocal function write_stats() statfile = io.open ('/tmp/stats.' .. mypid, 'w') statfile:write(table_print(stats.list())) statfile:close()end-- write stats every 1 secondevent.recurrent(1000, write_stats)
We are starting a project to overhaul user interface so you can look forward to bright future ... but it will take couple months or so. Thank you for patience!
Thanks for the workaround. I finally managed to get a somewhat reliable solution by adding a sleep() between the 2 commands in my python code.
By working on this, I discovered an important point that you may want to take into account when working on this:
When exporting to prometheus (at least) you can't aggregate the multiple workers stats simply by adding all the values and creating series with the results.
Prometheus expect that counters are monotonic, and can detect and handle counter reset. But if you aggregate all workers stats into a single counter, if a single worker restart, it break monotonicity, but does not properly reset the counter, which may result in inconsistent series in prometheus.
For instance, instead of creating a single anwser_total series containing the sum of all worker stats:
answer_total 5208
you have to create a series per worker (using labels) like this:
For reference, the issue with receiving multiple commands in one syscall should be solved since release 5.1.1: https://gitlab.labs.nic.cz/knot/knot-resolver/-/merge_requests/991. Newlines now work as command separators. Sending very long "strings" over a single connection still probably won't be reliable, but I don't think your use case is anywhere near that.