client retry logic on TCP/TLS connection closure

Please add link to discussion so we have all the pointers on one spot.

It was there but "hidden" as the FRITZ! link: https://forum.turris.cz/t/dns-over-tcp-just-a-single-transaction/12003/11

Okay, here comes the Wireshark log: pkgupdate.pcapng.gz

Steps to Reproduce

FRITZ!Box 7590 with FRITZ!OS 07.12 as router, DHCP server, and DNS resolver (default behavior)
Turris MOX 4.0.5 in mode (host) computer, DHCP client (default behavior)
DNSSEC disabled for simplicity
$ ssh root@turris.local
$ pkgupdate

The SSH console gives:

INFO:Target Turris OS: 4.0.5
line not found
line not found
line not found
ERROR:
runtime: [string "requests"]:395: [string "utils"]:427: URI download failed: Couldn't resolve host 'repo.turris.cz'

FRITZ!OS accepts only one DNS transaction per TCP connection. FRITZ!OS closes the TCP connection after 15 seconds. AVM ‘confirmed’ the issue, however, rejected several DNS transactions in one TCP connection as feature request.

I look closely at the first TCP flow, starting at packet numbered "8" in wireshark, and the behavior of FRITZ! (192.168.0.1) seems really weird to me. It answers the first two queries basically immediately (A and AAAA pair for the same name), and then there comes a looong period (~15 seconds) where it ACKs all the queries coming but never replies... and after that long time it closes the connection.

I mean, if they don't want to answer anything anymore, why keep the connection open? Even if we improved our side, there's "unavoidable" large delay due to them keeping it open, as we can have no idea that they won't reply.

Yepp, that is the problem. It looks like, AVM expects the querier to close the TCP connection. But then, as you state, how should a querier (supporting multiple ones) known that. dig on the command line stays there for those 15 seconds and cannot do anything else.

Sorry, I don't really see what could standard-compliant client do (at least without previous knowledge, probably hand-configured). https://tools.ietf.org/html/rfc7766 says that both sides should support query pipelining...

Can you post a link to relevant ticket from AVM?

Thank you!

Perhaps point them to the part saying

DNS servers (especially recursive) MUST expect to receive pipelined queries.

From point of view and knowledge, there is no alternative than to never timeout. That is the way dig does, so everyone should do it. Some day, the other party, here AVM, moves.

AVM does not have a public bug tracker. Anyway, the ID is 3301056. They simply rejected it as feature request. They placed an assumption about a rate-limiter within the FRITZ!OS. However, I have no clue how that can kick-in when even things like
dig @fritz.box +short +tcp +keepopen example.com A example.net A
fail.

@vcunat I am not sure I understand that part. A super-section states SHOULD (for Connection Reuse), a sub-section states MUST (for Query Pipelining). What is that about? If a super-section states SHOULD nothing below can be a MUST. Puh. I am really too stupid for RFCs …

My understanding is that the server MUST expect that clients will attempt pipelining and SHOULD support answering all requests that come that way – in particular, servers SHOULD NOT close the connection right after the first answer (that part is very old, actually). It's not clear what exactly this "expect" means about the behavior, so I suppose FRITZ! might still say they're compliant-ish; they apparently don't care about this feature.

From point of view and knowledge, there is no alternative than to never timeout. That is the way dig does, so everyone should do it. Some day, the other party, here AVM, moves.

What do you mean? I'm confused.

In any case, "never timeout" is not an option when it comes to anything on TCP. It would open path to attacks like SYN flood, SlowLoris etc.

Guys, I did my job by reporting and tracking down the cause as much as I could. Now, it is your job to find a solution. That is not my job, especially because I do not understand why a TCP client can be attacked. Anyway, just to re-emphasise the severity: Because of this, Turris OS is not able to update! This is not a special, rare configuration but the default of both, Turris OS and FRITZ!OS. Finally, when kresd waits these 15 seconds – it does wait already – all subsequent queries for the same domain within that 15 seconds fail (return a good looking empty answer). Therefore, the update script fails so severely because it asks for repo.turris.cz several times in a row.

I am still tracking down why kresd is switching to TCP, if that is something special in combination of FRITZ!OS too. Until then, this issue affects a vast majority in Germany, as FRITZ!Box is the IAD here.

@traud: well, yes, you're in the unfortunate position between these two implementations whose positions (honestly) seem dead-locked in a state that this configuration just won't work.

My practical recommendation is... just avoid this set up, i.e. is there a reason why you need Turris to forward DNS to FRITZ!Box? In my opinion the most natural mode of operation is to not use forwarding at all – one click in Foris GUI – and there are also a few other easy options to forward to some public services (usually secured by TLS).

@traud Do you remember when it started happening? I doubt you are the only Turris OS user in Germany so I want to find out what changed to see if we can fix that.

One more thing: Please provide model number + firmware version so we have enough information when talking to AVM. Thank you for your time and patience!

added needinfo label

I think a significant fraction of Omnias is in Germany. For quick reference, in the original crowd-campaign it was ~14% money.

avoid this set up

Sure … you are telling the wrong one. I could not care less because I ‘solved’ this issue for me long ago. However, it took me a lot of effort to find the root cause (DNS), to find a workaround. There are a lot of other approaches, like changing the default in Turris OS, adding more verbose logs, or even removing that CNAME/A to proxy.turris.cz and just using an A record. Again, it is not my job to dictate a solution. If you do not care either, that’s it. All I can do is offering my help if you need more testing or answer subsequent questions.

model number + firmware

FRITZ!Box 7590 FRITZ!OS 07.12
FRITZ!Box 7490 FRITZ!OS 07.19-76429 (that is the current head, a beat version)

Both are mentioned in the ticket which AVM has. If you like, I check other current branches of FRITZ!OS like 06.8x.

when it started

Day 1. In January, I got this Turris MOX used. It was never unpacked. It was still on Turris OS 4.0 – not sure if the Web interface printed the exact version back then. If you need the serial number to track which was the shipping version, I can provide that. Actually, on the forum, a lot of other users face the same symptom. Again, it is just the symptom; we do not know if it is the same cause. However, I confirmed with this user, that his Turris MOX failed to update and he is using a FRITZ!Box as well. Actually, he did not not notice that is Turris MOX was not at the latest version and still created a report. Go figure!

By the way, the rescue modes use the resolver of FRITZ!Box as well. Although those modes should be affected as well, at least rescue mode 6 is not trapped by this, actually it never switches to TCP.

removed needinfo label

@traud Sorry for delay, I'm attempting to reach responsible AVM engineer to find a systemic solution.

Thank you so much for your time investment, it is very much appreciated.

@traud Hi! I was told that Fritz Box Labor version (07.19-78839) has improvements for DNS TCP. By chance can you test this version?

Thanks.

Thank you for keeping track! Did AVM tell you this? Interesting, neither did they notify me as original ticket creator nor via the release notes.

Anyway, yes, both issues are solved by that update (multiple TCP queries and the caching behavior in case of query-case randomization). Because of the latter, Turris OS is not going to run into DNS over TCP at all. Nevertheless, this is something to watch out for, because other DNS implementations might behave similar. And, worse, many FRITZ!Box are not receiving this major update. It is questionable whether AVM is going to backport this to their older FRITZ!OS branches 07.1x, 07.0x, 06.8x, 06.5x, and 06.3x which still got security updates (full list). AVM is not that transparent but shows similar behavior like Microsoft with their Windows feature upgrades.

Actually, even me is affected by this policy because my (normal) main FRITZ!Box (and my spare/backup box) will not get to that major update. Consequently, my out-of-box experience with Turris OS would not change. Luckily, we know zillion of workarounds for Turris OS now, like disabling the query-case randomization, choosing DNS over TLS, or going for a new FRITZ!Box …

mentioned in issue #629

client retry logic on TCP/TLS connection closure

Child items ...

Activity

Admin message

client retry logic on TCP/TLS connection closure

Activity