No commit message

Tomáš Gavenčiak · 00ee22e2
--- a/design-notes.md
+++ b/design-notes.md
@@ -6,218 +6,170 @@

 ## Language and libraries

-Language standart is C99. The proposed libraries are:
+Language standard is C99. The proposed libraries are:
 * [Libtrace](http://research.wand.net.nz/software/libtrace.php) for packet capture, dissection and dumping.
-* [libUCW](http://www.ucw.cz/libucw/) for configuration parsing, logging, mempools (in the future?) and some data structures (currently doubly linked lists). Replacable but convinient.
-* libLZ4, libgz, ... for online (de)compression of input pcaps and output files. Partially implemented separately, but also part of libtrace.
-* CBOR: [libcbor](http://libcbor.org/) or other implementation
+* [libUCW](http://www.ucw.cz/libucw/) for configuration parsing, logging, mempools (in the future?) and some data structures (currently doubly linked lists). Replaceable but convenient.
+* libLZ4, libgz, liblzma, ... for online (de)compression of input pcaps and output files.
+* CBOR: [libcbor](http://libcbor.org/) or other implementations?


 ## Operation and main structures

 ### Struct collector

-Main container for a collector instance (try toavoid global state).
+Main container for a collector instance (avoiding global variables).

 #### Has
-* Configuration structure (given / loaded before init) (incl. outputs)
+* Configuration structure (given / loaded before init) (incl. inputs and outputs)
 * Current and previous timeframe
-* Queue of timeframes to write (thread safe) and writer thread(s)
+* Writer thread(s)
 * Basic stats on program run (time, packets collected/dropped)
+* Collector-global lock on the shared state
+* Signal from the collector to the writers that more frames to write are available
+* Signal from the writers to the collector that more queue space is available (in offline mode)

 #### Setup
-Gets a configuration struct, initializes self and opens a packet capture
-(file list or live, applying capture length, promiscuous settings and BPF filters).
+Gets a configuration struct, initializes self, starts writer threads and opens a packet capture (file list or live, applying capture length, promiscuous settings and BPF filters). 

 #### Main thread operation
-Main thread collects a packet from the input and parses its data (IP/UDP/DNS headers). 
-If the time is past the current timeframe, does a frame rotation (see below).
-When the packet is invalid (malformed, unsupported network feature, ...) drop it and optionally
-dump via some of the outputs.
+Main thread collects a packet from the input, parses its data (IP/UDP/DNS headers) and finds query matches. If the packet time is past the current timeframe, does a frame rotation (see below). When the packet is invalid (malformed, unsupported network feature, ...) drop it and optionally dump via some of the outputs (TODO:redesign dumping to use a separate thread).

-#### Fame rotation
-The timeframes are cca 0.1-10 sec long time windows (configurable). Any response packet is matched to a
-request packet in the current or te previous timeframe (so a response delayed up to the
-frame length is always matched). When a packet beyond the current timeframe is read, the
-frames are rotated: The previous timeframe is enqueued for writeout, the current timeframe
-becomes the previous one and a new timeframe is created. 
+Shared state, the locks and signals are accessed only a bounded number of times (cca 10-20) per every timeframe (not per packet!) so the overhead is very small.

-If the new timeframe is beyond the current period of an output file, this output is rotated as well before 
-writing the frame (see below).
+#### Fame rotation
+The timeframes are 0.1-10 sec long time windows (configurable). Any response packet is matched to a request packet in the current or the previous timeframe (so a response delayed up to the frame length is always matched). When a packet beyond the current timeframe is read, the
+frames are rotated: The previous timeframe is enqueued for writeout at every output, the current timeframe becomes the previous one and a new timeframe is created.

-If a packet arrives out of order (with time smaller than the previous packet, as in wrong ordering of PCAP files),
-a warning is issued and it is processed as if it had the time of the last in-order packet. 
+If a packet arrives out of order (with time smaller than the previous packet, as in wrong ordering of PCAP files), a warning is issued (TODO?) and it is processed as if it had the time of the last in-order packet. 

 #### Writer thread 
-One or more writer threads picking up timeframes from the queue and writing their packets to the outputs.
-Destroy the packets and timeframes afterwards. If a timeframe is the last one to use an output file, that file
-is closed.
+One or more writer threads with private timeframe queues. The queue has limited length (configurable) to limit memory usage. When the queue is full, the main thread waits (offline mode) or drops the oldest not-currently-processed frame in the output queue. This way, there are at most `(max. length of queue) + (number of outputs) + 2` live timeframes (the 2 are in the main thread). The timeframes have to be processed in the order of creation (two threads can not easily cooperate on single output).

-The timeframes have to be processed in the order of creation
+If the new timeframe is beyond the current period of an output file, this output is rotated as well before writing the frame (see below).

 #### Current state
-* The writeout is done in the same thread.
-* Only one output file per configured output is open.
-* Stats to keep track of are not finalised.
+* Mostly as above. Multiple outputs of the same type can be open, even with different rotation opts. Common code base for output rotation, file naming, compression.
+* Per-output stats are not designed well.
+* Pcap dump output needs redesign.

 ### Struct config

-Holds collector configuration and configured inputs and outputs.
+Holds collector configuration and configured inputs and outputs. Every input and output get theit own `dns_input` and `dns_output` struct.
 Configured via [libucw configuration system](http://www.ucw.cz/libucw/doc/ucw/conf.html).

 ### Struct timeframe

-Structure for queries within a time window (cca 1-10 sec, configurable). Contains all requests within
-that window, their matching responses within that or the next timeframe, and responses within this
-timeframe without a matching request. This limits the writer threads to one (simpler situation and code
-structure) or to one per configured output (giving each output a timeframe queue with read-only
-timeframes, destroying each frame when processed by all outputs (which may require refcounting)).
+Structure for queries within a time window (cca 1-10 sec, configurable). Contains all requests within that window, their matching responses within that or the next timeframe, and responses within this timeframe without a matching request.

-The preferred direction seems to be one thread per putput, separating their different runtime requirements.
-This way it may be more natural to drop timeframes only for "slow" outputs (e.g. PCAP) when their queue gets too long,
-and not for "fast" ones (e.g. counting-only statistics).
+Each timeframe is either in the main thread (current or previous frame) or in some of the output queues. The timeframes in the queues are refcounted to allow flexible timeframe dropping. The refcounts are under the global lock.

-Shared state (with locks) should be accessed only a few times per timeframe, not per packet.
+The separate queues allow e.g. dropping of timeframes on slow outputs (under heavy loads) while allowing fast outputs (e.g. counting stats) to see all the timeframes. 

 #### Has
 * List of packets to write - possibly with rate-limiting per timeframe (linked list).
-* List of dropped packets to dump - likely with rate-limiting per timeframe (linked list).
-* Hash containing unmatched requests (by IPver, TCP/UDP, client/server port numbers, client/server IPs, DNS ID and QNAME)
-* Possibly: a memory pool for all the packet data
+* Hash containing unmatched requests (by IPver, TCP/UDP, client/server port numbers, client/server IPs, DNS ID and QNAME) (can be freed on insertion to output queues).
+* Refcount (number of queues containing it)
+* Possibly(TODO?): a memory pool for all the packet data

 #### Query hash
-The hash is a fixed-size table of configurable order. Rationale: rehashing could cause a lot of latency in the main thread.
-A big enough hash for the upper limit of packets in the timeframe (hard limit or just estimated) takes about 3% of memory of the packets,
-so a big enough table can be easily afforded within the expected memory usage.
+The hash is a fixed-size table of linked packet lists with configurable order (`size = 1 << order`). **Rationale:** rehashing could cause a lot of latency in the main thread. Having a reasonable limit on timeframe size to limit memory usage limits the requirement on hash size. (Even allowing 200MB of memory per timeframe means ~1M packets, comfortable 2M element table (order 21) takes 16MB). A big enough hash for the upper limit of packets in the timeframe (hard limit or just estimated) takes about 8% of memory of the packets, so a big enough table can be easily afforded within the expected memory usage. The hash is a linked list of packets in each bucket (with the "next" ptr within the packet struct).

-The hash is a linked list of packets in each bucket (with the "next" ptr within the packet struct).
+Currently matching pairs on `(IPver, transport, client IP, client port, server IP, server port, DNS ID, QNAME, QType, QClass)`
+
+* QNAME, QType, QClass might not be present in all responses (e.g. NOT_IMPL)
+* server IP/port is redundant for ICANN, but might be useful for client watching (or upstream from DNS cache, ... ?)

 #### Limiting memory use

 The numer of requests (and unmatched responses) in the frame should be bounded by a configurable constant. This should be a soft limit (e.g. packet should be dropped more frequently when approaching the limit). When a request is accepted, its response should be always accepted.

-**Estimate:** with limit 1Mq per frame, cca 200 B/q (in memory) and 5x 1s frames in the queue, 1GB of memory should suffice for the packets.
-
-**Question:** What to do with the (not dropped) responses to interface-dropped requests? 
+**Estimate:** with limit 1Mq per frame, cca 200 B/q (in memory), 1 output and 5x 1s frames in the queue, 1.6GB of memory should suffice for all the packets.

 **Rationale:** The packets in the timeframes take up most of collector memory. Since the memory use of a single packet
 is bounded by the packet capture bound plus a fixed overhead, bounding the packet number per timeframe is an easy and
 deterministic way to control memory usage (together with the number of timeframes).

-**Alternatives:** Total packet count could better accomodate short-time bursts (spanning, say, 1-2 timeframes), but
+**Alternatives:** Total packet count (for the entire collector) could better accomodate short-time bursts (spanning, say, 1-2 timeframes), but
 keeping these numbers in sync between the threads adds complexity. Also, this behaviour is less predictable.
 Another alternative is considering the total memory usage of the program. Not sure how technically viable and
 reliable (what to measure? would such memory usage shrink on `free()`?), and might not be very predictable.

+**Question:** What to do with the (not dropped) responses to interface-dropped requests? 
+
 ### Struct packet

-Holds data about a sggle query packet. Uses libtrace to handle packet data management and dissection.
-The DNS parsing is done by a simple header and QNAME label reading without compression. The remaining
-parts of the DNS message (various RRs) are not parsed (until we figure out how would they be useful).
+Holds data about a single parsed query packet. Using libtrace to handle packet data management and dissection, only the extracted network info and raw DNS data are stored in the packet.
+The DNS parsing is done by a simple header and QNAME label reading without compression. The remaining parts of the DNS message (various RRs) are not parsed (until we figure out how would they be useful). The full request DNS data might be stored (as it may be useful later for replays), only a part of the response (status code, metadata) might get stored.

-**Rationale:** The data in RRs can be quite big and complex to know in advance what all we might want to know.
-The data from DNS header + QNAME + QType + QClass seem to carry enough information for statistics.
-Replies should in principle be recomputable from requests. 
-If it is necessary to store all the infrmation, a full PCAP (in a separate process)
-could be more appropriate.
+**Rationale:** The data in RRs can be quite big and complex to know in advance what all we might want to know. The data from DNS header + QNAME + QType + QClass + DNSSEC-flags seem to carry enough information for statistics. The DNSSEC info will require some RR parsing (use ldns?). Replies should in principle be recomputable from requests. If it is necessary to store all the information, a full PCAP (in a separate process) could be more appropriate.

 #### Has
-* Raw packet data: timestamp, real length, capture lenght, packet data
+* Raw packet data: timestamp, real length, capture lenght
 * Adresses, ports, transport info
-* DNS header data, qname as a printable string (dot notation)
+* Raw DNS data (full for requests, truncated for responses), partially parsed DNS header (QNAME position and length)
 * Request may have a matching response packet. In this case the response is owned by the request 
-* (Next packet in hash bucket, next packet in timeframe)
+* Next packet in hash bucket, next packet in timeframe

 #### Packet network features
+
 Handles both IPv4 and IPv6, as well as UDP.

-Does not currently handle packet defragmentation. This would be nontrivial to do right and to manage resources for
-while fragmentation on the IP level is rare for DNS packets. Fragmented packets can be all dumped for later analysis.
+Does not currently handle packet defragmentation. This would be nontrivial to do right and to manage resources for while fragmentation on the IP level is rare for DNS packets. Fragmented packets can be all dumped for later analysis.
 **Alternative solution:** Capturing via Linux RAW `socket()` gives us IP-defragmented packets.

-TCP flow could be reconstructed, but it seems less of a priority. Currently a one-data-packet TCP streams 
-(not counting SYN, ACK and FIN packets) are processed, longer streams are dropped. Longer packets and
-long-open TCP connections seem to be uncommon.
+TCP flow could and should be reconstructed, but it seems less of a priority. Previously, one-data-packet TCP streams (not counting SYN, ACK and FIN packets) were processed, longer streams dropped - this can be re-enabled relatively easily. Longer packets and long-open TCP connections seem to be uncommon, but might be important in the future.

 ### Stats

-Very basic statistics for the collector (time, dropped/read packets, dropped frames), the timeframes (dropped/read packets),
-the outputs (dropped/read packets, dropped timeframes, written items and bytes before/after compression).
-Not clear what all to measure. Any DNS data statistics should be handled by an output plugin.
-
-Currently partially implemented.
+Very basic statistics for the collector (time, dropped/read packets, dropped frames), the timeframes (dropped/read packets), the outputs (dropped/read packets, dropped timeframes, written items and bytes before/after compression). Not clear what all to measure. Any DNS data statistics (DSC-like) should be handled by an output plugin. Currently partially implemented and to be redesigned.

 ### Outputs

-Each output type extends a basic output structure. This basic structure contains the current open file and filename
-(or socket, etc.), time of opening, rotation period, compression settings, basic statistics (bytes written, frames dropped, ...)
-and hooks for packet writing, dumping, file closing and opening.
+Each output type extends a basic output structure. This basic structure contains the current open file and filename (or socket, etc.), time of opening, rotation period, compression settings, basic statistics (bytes written, frames dropped, ...) and hooks for packet writing, dumping, file closing and opening.

-Each output type (currently CSV, ProtoBuf, PCAP) extends this type with additional struct fields and sets the hooks
-appropriately on config (configuration handled by libucw). The current fields are:
+Each output type (currently CSV, defunct ProtoBuf, in the future CBOR) extends this type with additional struct fields and sets the hooks appropriately on config (configuration handled by libucw).

-flags(IPv4/6,TCP/UDP) client-addr client-port server-addr server-port id qname qtype qclass request-time-us request-flags request-ans-rrs request-auth-rrs request-add-rrs request-length response-time-us response-flags response-ans-rrs response-auth-rrs response-add-rrs response-length
+The current output fields are:

+flags(IPv4/6,TCP/UDP) client-addr client-port server-addr server-port id qname qtype qclass request-time-us request-flags request-ans-rrs request-auth-rrs request-add-rrs request-length response-time-us response-flags response-ans-rrs response-auth-rrs response-add-rrs response-length

-Every output has a pathname template with strftime() replacement. An output can be compressed on the fly (which saves
-disk space and also write time). Fast compression (LZ4, ...) is preferred.
+Every output has a pathname template with `strftime()` replacement. An output can be compressed on the fly (which saves disk space and also write time). Fast compression (LZ4, ...) is preferred, currently LZ4 is implemented. A packet-block-based compression can be enabled.

 #### Memory usage limits
-The maximum length of the timeframe queue of every output should be bounded (and configurable).
-When exceeded, oldest timeframe not being currently processed should be dropped.
-Rationale: Together with timeframe size this predictably limits 
-total memory usage. Dropping data on lagging (e.g. IO-bound) outputs is preferable to dropping packets on input
-and therefore missing them on fast (e.g. counting) outputs.
+See the timeframe discussion.

 #### Disk usage limits
-Optional. When approaching a per-output-file size limit, softly introduce query skipping.
+Optional. When approaching a per-output-file size limit, softly introduce query skipping. Not implemented.

 #### CSV output
 Optional header line, configurable separator, configurable field set.
-Actually not much larger than Protocol Buffers when compressed (e.g. with just the very fast "lz4 -4": 33 B/query CSV, 29B/query ProtoBuf).
-Most commonly accepted format. No quoting necessary with e.g. "|" delimiter.
+Actually not much larger than Protocol Buffers when compressed (e.g. with just the very fast "lz4 -4": 33 B/query CSV, 29B/query ProtoBuf). Most commonly accepted format. No quoting necessary with e.g. "|" as the delimiter and "reasonable" QNAMES. Currently, unprintable chars and the separator in QNAME is replaced.

 #### Protocol Buffer output
 Similar to CSV, configurable field set, one length-prefixed (16 bits) protobuf message per query.
-Library `protobuf-c` seems to use reflection when serialising rather than fully generated code (as it does in C++)
-so the speed is not great (comparable to CSV?).
-
-#### PCAP
-Currently only used for dropped packets. Should be rate-limited (with softly increasing drop-rate).
-
+Library `protobuf-c` seems to use reflection when serialising rather than fully generated code (as it does in C++) so the speed is not great (comparable to CSV?). Also, the length-prefixing is necessary and makes the output unreadable by standard Protobuf tools. Output code currenntly not updated and defunct.

 ## Inputs

-The input is either a single interface, or a list of pcap files to be processed in the given order. When reading pcap files, the "current" time follows the recorded times. 
-
-For online capture, multiple interfaces are supported. Configurable promiscuous mode, BPF filtering.
+The input is either a list of live interface, or a list of pcap files to be processed in the given order. When reading pcap files, the "current" time follows the recorded times. For online capture, multiple interfaces are supported. Configurable promiscuous mode, BPF filtering.

-**Note:** Multiple reader threads are hard to support, as the access to the query hash would have to be somehow guarded. Since the main congestion is expected to be at the outputs, this may not be problem. If required in the future, can be a (very) advanced feature.
+**Note:** Multiple reader threads would be harder to support, as the access to the query hash would have to be somehow guarded, or the hashes be per-theread. Since the main congestion is expected to be at the outputs, this may not be a problem. If required in the future, can be a (very) advanced feature.

-[Libtrace](http://research.wand.net.nz/software/libtrace.php) is preferred to tcpdump's PCAP for the larger
-feature set, implemented header and layer skipping, larger set of inputs (including kernel ring buffers).
+[Libtrace](http://research.wand.net.nz/software/libtrace.php) is preferred to tcpdump's PCAP for the larger feature set, implemented header and layer skipping, larger set of inputs (including kernel ring buffers).

 ## Configuration / options

-Configuration is read by the [libucw configuration system](http://www.ucw.cz/libucw/doc/ucw/conf.html).
-Configuration should allow setting predictable limits on memory usage and potentially disk usage.
-CPU usage should be regulated by means of the OS (nice, cpulimit, cgroups).
-
-Reloading is supported only via program restart. Optionally, the program could wait until the outputs are rotated 
-(or at least until timeframe rotation). Shortly after program start, unmatched responses should be ignored.
-The amount of missed packets should not be significant relative to the frequency of such changes. Major
+Configuration is read by the [libucw configuration system](http://www.ucw.cz/libucw/doc/ucw/conf.html). Configuration should allow setting predictable limits on memory usage and potentially disk usage. CPU usage should be regulated by means of the OS (nice, cpulimit, cgroups).

-Supporting online reconfiguration would greatly complicate program complexity and potentially introduce bugs and memory leaks.
-a potential exception could be the BPF filter string. What would be good use-cases or easily tunable parameters?
+Reloading config is supported only via program restart. Optionally, the program could wait until the outputs are rotated (or at least until timeframe rotation). Shortly after program start, unmatched responses should be ignored. The amount of missed packets should not be significant relative to the frequency of such changes.

+Supporting online reconfiguration would complicate program complexity. What makes sense to be reconfigurable?

 ## Logging and reports

-Currently using libucw logging system and configured via the same config file. Includes optional log file rotation.
-A sub-logger for potentially frequent messages with rate-limiting is also configured by default.
+Currently using libucw logging system and configured via the same config file. Includes optional log file rotation. A sub-logger for potentially frequent messages with rate-limiting is also configured by default.

-Input and output statistics should be logged (e.g. on output file rotation). 
-Statistical outputs migh include some statistics. No other reporting mechanism is currently desined.
+Input and output statistics should be logged (e.g. on output file rotation). Statistical outputs migh include some statistics. No other reporting mechanism is currently desined.


 ## Questions
@@ -225,37 +177,3 @@ Statistical outputs migh include some statistics. No other reporting mechanism i
 * Runtime control and reconfiguration - how much control is desired and useful? How to implement it? Currently: No runtime control.

 * Which output modules to support? CSV, Protobuf, counting stats (DSC-like?), CBOR, ...
-
-# Older (process again):
-
-## TCP/IP status:
-
-* Accepts both IPv4 and IPv6
-* No IP fragment reconstruction
-  * Not planned (rather technical, separate for IPv4 and IPv6, ...)
-  * Should not happen too much anyway (very few requests have >100 bytes, very few responses have >1000 bytes)
-* TCP is currently dropped ~~limited to (single request, single response) streams, TCP options accepted but skipped~~
-  * These short TCP connections seem to bee (almost?) all the cases in "akuma" data
-  * PLANNED: TCP flow reconstruction, keeping open connections (currently ignores SYN/ACK/FIN)
-* UDP fully suported
-* Dropping all packets with data size mismatches etc. Optional PCAP dump of such packets (currently deactivated, TODO)
-
-## DNS status
-
-* Dropping packets with QNAME length above 254 (by RFC)
-* Dropping packets with the snapshot (captured part) ending before the entire DNS QNAME part (should not happen with reasonable snaplen)
-* Matching pairs on (IPver, transport, client IP, client port, server IP, server port, DNS ID, QNAME, QType, QClass)
-  * QNAME, QType, QClass might not be present in all responses (e.g. NOT_IMPL)
-  * server IP/port is redundant for ICANN, but might be useful for client watching (or upstream frmo DNS cache, ... ?)
-
-## Output
-
-* Modular, currently CSV and obsolete (but close to working) Protobuf
-* Optional in-line compression (currently lz4), stored-field selection and time-based output file rotation (based on filename format string)
-* Separate threads for:
-  * main thread - packet collection, parsing and matching responses with requests
-  * thread for every configured output
-* multiple completely independent instances of the same output type can be configured
-
-### Dumping dropped packets
-* Configurable dump/drop by category