Knot stuck with too many TCP on CLOSE_WAIT state
One of our production servers has failed to refresh several zones today. In the logs, I see lots of messages like this:
2015-02-05T04:56:16 error: throttling TCP connection pool for 14 seconds, too many allocated resources
2015-02-05T04:56:16 error: cannot accept connection (24)
I next ran "netstat -pan", and saw lots and lots of TCP connections to knotd, stuck in the CLOSE_WAIT state. These TCP connections are from various addresses, both IPv4 and IPv6. I don't know if this is a deliberate attack of some kind, or just a bug that prevents knotd from freeing up the sockets after the TCP connection has ended. Anyway, this appears to prevent knotd from creating outbound TCP connections to our master to do zone transfers. When I traced the process, I saw:
# strace -p 6379
Process 6379 attached - interrupt to quit
restart_syscall(<... resuming interrupted call ...>) = 0
rt_sigprocmask(SIG_UNBLOCK, [], NULL, 8) = 0
pselect6(516, [515], NULL, NULL, NULL, {NULL, 8}) = 1 (in [515])
rt_sigprocmask(SIG_BLOCK, [], NULL, 8) = 0
accept(515, 0, NULL) = -1 EMFILE (Too many open files)
write(525, "2015-02-05T10:25:28 error: canno"..., 57) = 57
write(525, "2015-02-05T10:25:28 error: throt"..., 103) = 103
The result of this is that many zones are now stale. I think I can probably recover from this by just restarting knot. But I want to log this report in case there is a bug in knot that could hold open sockets for too long.