Query not matched with the answer (non-monotone PCAP)
It seems that
dns-collector when matching a DNS query with a DNS answer doesn't rely on packet timestamp but rather on the packets order in PCAP file. Such an approach results in severe data inconsistencies in impala database (many NULL values).
The problem was described below basing on the query for a domain name
beardmens.cz observed in PCAP for
dns-s-02 server. Although the timestamp indicates that a query was observed first (before the answer), under certain circumstances it may be stored in a PCAP file in reverse order:
$ tshark -t e -r dns-s-02.20180618.075500.008305 'dns.qry.name=="beardmens.cz"' 1568 1529308502.421889 188.8.131.52 → 184.108.40.206 DNS 774 Standard query response 0x9172 No such name NS beardmens.cz NSEC3 RRSIG NSEC3 RRSIG NSEC3 RRSIG SOA a.ns.nic.cz RRSIG OPT 1645 1529308502.421780 220.127.116.11 → 18.104.22.168 DNS 69 Standard query 0x9172 NS beardmens.cz OPT
For such an input
dns-collector produces two separate rows in CSV file (unmatched query + unmatched answer) what is not correct:
$ grep beardmens.cz preproc/dns-s-02/dns-s-02.20180618.075500.008305.csv 1529308502.421889|||746||774|22.214.171.124|33572|126.96.36.199|53|17|4|64||37234|2|1|0|3|1|0||0||0||beardmens.cz.|0|1|8||||||\N|\N|\N|\N|\N|\N 1529308502.421780||41||69||188.8.131.52|33572|184.108.40.206|53|17|4|54|15390|37234|2|1|0||||0||0||1|beardmens.cz.||||0|4096|1|||\N|\N|\N|\N|\N|
To compare, a pcap-to-parquet tool provided with entrada system produces proper output (query matched with the answer).