Query not matched with the answer (non-monotone PCAP)
It seems that dns-collector
when matching a DNS query with a DNS answer doesn't rely on packet timestamp but rather on the packets order in PCAP file. Such an approach results in severe data inconsistencies in impala database (many NULL values).
The problem was described below basing on the query for a domain name beardmens.cz
observed in PCAP for dns-s-02
server. Although the timestamp indicates that a query was observed first (before the answer), under certain circumstances it may be stored in a PCAP file in reverse order:
$ tshark -t e -r dns-s-02.20180618.075500.008305 'dns.qry.name=="beardmens.cz"'
1568 1529308502.421889 194.0.13.1 → 138.68.242.82 DNS 774 Standard query response 0x9172 No such name NS beardmens.cz NSEC3 RRSIG NSEC3 RRSIG NSEC3 RRSIG SOA a.ns.nic.cz RRSIG OPT
1645 1529308502.421780 138.68.242.82 → 194.0.13.1 DNS 69 Standard query 0x9172 NS beardmens.cz OPT
For such an input dns-collector
produces two separate rows in CSV file (unmatched query + unmatched answer) what is not correct:
$ grep beardmens.cz preproc/dns-s-02/dns-s-02.20180618.075500.008305.csv
1529308502.421889|||746||774|138.68.242.82|33572|194.0.13.1|53|17|4|64||37234|2|1|0|3|1|0||0||0||beardmens.cz.|0|1|8||||||\N|\N|\N|\N|\N|\N
1529308502.421780||41||69||138.68.242.82|33572|194.0.13.1|53|17|4|54|15390|37234|2|1|0||||0||0||1|beardmens.cz.||||0|4096|1|||\N|\N|\N|\N|\N|
To compare, a pcap-to-parquet tool provided with entrada system produces proper output (query matched with the answer).