Tomas Gavenciak · 47444c97
--- a/design-notes.md
+++ b/design-notes.md
+# Collector notes
+
+* C99 with [libUCW](http://www.ucw.cz/libucw/doc/ucw/) and protobuf-c
+* PLANNED: Time frames (cca 1-300s), soft rate-limiting
+
+## Input
+* Currently only PCAP
+* Looking at libtrace and SOCK_RAW sockets
+* Supports truncated packets (length checks in all the code)
+
+## TCP/IP status and assumptions
+* Accepts both IPv4 and IPv6
+  * Currently drops IPv6 with extra headers (TODO: skip them, detect fragmentation headers) (none encountered in `akuma` data)
+* No IP fragment reconstruction
+  * Not planned (rather technical, separate for IPv4 and IPv6, ...)
+  * Opening a SOCK_RAW socket handles IP-reconstruction in kernel
+  * Should not happen too much anyway (very few requests have >100 bytes, very few responses have >1000 bytes)
+* TCP is limited to (single request, single response) streams, TCP options accepted but ignored
+  * These short TCP connections seem to bee (almost?) all the cases in "akuma" data
+  * Find out: how many long TCP conns are there?
+  * PLANNED: TCP flow reconstruction, keeping open connections (currently ignores SYN/ACK/FIN)
+* UDP fully suported
+* Dropping all packets with data size mismatches etc.
+
+## DNS status
+* Dropping packets with `OPCODE != QUERY`
+  * Store some other opcode? (IQUERY is obsolete, STATUS?)
+* Dropping packets with QNAME length above 254 (by RFC)`
+* Only accepting packets with exactly 1 QNAME
+* Dropping packets with "compressed" QNAME, see [RFC section](https://tools.ietf.org/html/rfc1035#section-4.1.4)
+  * Find out: are those still used?
+* Dropping packets with the snapshot (captured part) ending before the entire DNS QNAME part (should not happen with reasonable snaplen)
+* TODO NEXT: Actually match the queries and responses 
+
+## Output
+
+* Modular, not dependent on protobufs (Can include CBOR or other if needed.)
+* PLANNED: separate threads for:
+  * 1x packet collection, parsing, dumping and matching responses with requests (hash table)
+  * (1+)x time frame serialization and writing (file, socket or database)
+
+### Protobuf
+* Implemented a message for request+response pair writing (`dnsquery.proto`)
+* PLANNED: Configurable which attributes are included
+
+### Dumping dropped packets
+* Configurable dump/drop by category
+* PLANNED: Rotate pcap files with time frames
+* PLANNED: Soft rate-limiting to prevent choking#### Dns collector design draft
+
+* Author: Tomáš Gavenčiak, tomas.gavenciak@nic.cz
+* Date: 1st Mar 2016
+
+### Operation and main structures
+
+## Struct collector
+
+Main container for a collector instance (try toavoid global state).
+
+# Has
+* Configuration structure (given / loaded before init) (incl. outputs)
+* Current and previous timeframe
+* Queue of timeframes to write (thread safe) and writer thread(s)
+* Basic stats on program run (time, packets collected/dropped)
+
+# Setup
+Gets a configuration struct, initializes self and opens a packet capture
+(file list or live, applying capture length, promiscuous settings and BPF filters).
+
+# Main thread operation
+Main thread collects a packet from the input and parses its data (IP/UDP/DNS headers). 
+If the time is past the current timeframe, does a frame rotation (see below).
+When the packet is invalid (malformed, unsupported network feature, ...) drop it and optionally
+dump via some of the outputs.
+
+# Fame rotation
+The timeframes are cca 0.1-10 sec long time windows (configurable). Any response packet is matched to a
+request packet in the current or te previous timeframe (so a response delayed up to the
+frame length is always matched). When a packet beyond the current timeframe is read, the
+frames are rotated: The previous timeframe is enqueued for writeout, the current timeframe
+becomes the previous one and a new timeframe is created. 
+
+If the new timeframe is beyond the current period of an output file, this output is rotated as well before 
+writing the frame (see below).
+
+If a packet arrives out of order (with time smaller than the previous packet, as in wrong ordering of PCAP files),
+a warning is issued and it is processed as if it had the time of the last in-order packet. 
+
+# Writer thread 
+One or more writer threads picking up timeframes from the queue and writing their packets to the outputs.
+Destroy the packets and timeframes afterwards. If a timeframe is the last one to use an output file, that file
+is closed.
+
+The timeframes have to be processed in the order of creation
+
+# Current state
+* The writeout is done in the same thread.
+* Only one output file per configured output is open.
+* Stats to keep track of are not finalised.
+
+## Struct config
+
+Holds collector configuration and configured inputs and outputs.
+Configured via [libucw configuration system](http://www.ucw.cz/libucw/doc/ucw/conf.html).
+
+## Struct timeframe
+
+Structure for queries within a time window (cca 1-10 sec, configurable). Contains all requests within
+that window, their matching responses within that or the next timeframe, and responses within this
+timeframe without a matching request. This limits the writer threads to one (simpler situation and code
+structure) or to one per configured output (giving each output a timeframe queue with read-only
+timeframes, destroying each frame when processed by all outputs (which may require refcounting)).
+
+The preferred direction seems to be one thread per putput, separating their different runtime requirements.
+This way it may be more natural to drop timeframes only for "slow" outputs (e.g. PCAP) when their queue gets too long,
+and not for "fast" ones (e.g. counting-only statistics).
+
+Shared state (with locks) should be accessed only a few times per timeframe, not per packet.
+
+# Has
+* List of packets to write - possibly with rate-limiting per timeframe (linked list).
+* List of dropped packets to dump - likely with rate-limiting per timeframe (linked list).
+* Hash containing unmatched requests (by IPver, TCP/UDP, client/server port numbers, client/server IPs, DNS ID and QNAME)
+* Possibly: a memory pool for all the packet data
+
+# Query hash
+The hash is a fixed-size table of configurable order. Rationale: rehashing could cause a lot of latency in the main thread.
+A big enough hash for the upper limit of packets in the timeframe (hard limit or just estimated) takes about 3% of memory of the packets,
+so a big enough table can be easily afforded within the expected memory usage.
+
+The hash is a linked list of packets in each bucket (with the "next" ptr within the packet struct).
+
+# Limiting memory use
+The numer of requests (and unmatched responses) in the frame should be bounded by a configurable constant.
+This should be a soft limit (e.g. packet should be dropped more frequently when approaching the limit).
+When a request is accepted, its response should be always accepted.
+
+**Question:** What to do with the (not dropped) responses to dropped requests? 
+
+**Rationale:** The packets in the timeframes take up most of collector memory. Since the memory use of a single packet
+is bounded by the packet capture bound plus a fixed overhead, bounding the packet number per timeframe is an easy and
+deterministic way to control memory usage (together with the number of timeframes).
+
+**Alternatives:** Total packet count could better accomodate short-time bursts (spanning, say, 1-2 timeframes), but
+keeping these numbers in sync between the threads adds complexity. Also, this behaviour is less predictable.
+Another alternative is considering the total memory usage of the program. Not sure how technically viable and
+reliable (what to measure? would such memory usage shrink on `free()`?), and might not be very predictable.
+
+## Struct packet
+
+Holds data about a sggle query packet. Uses libtrace to handle packet data management and dissection.
+The DNS parsing is done by a simple header and QNAME label reading without compression. The remaining
+parts of the DNS message (various RRs) are not parsed (until we figure out how would they be useful).
+
+**Rationale:** The data in RRs can be quite big and complex to know in advance what all we might want to know.
+The data from DNS header + QNAME + QType + QClass seem to carry enough information for statistics.
+Replies should in principle be recomputable from requests. 
+If it is necessary to store all the infrmation, a full PCAP (in a separate process)
+could be more appropriate.
+
+# Has
+* Raw packet data: timestamp, real length, capture lenght, packet data
+* Adresses, ports, transport info
+* DNS header data, qname as a printable string (dot notation)
+* Request may have a matching response packet. In this case the response is owned by the request 
+* (Next packet in hash bucket, next packet in timeframe)
+
+# Packet network features
+Handles both IPv4 and IPv6, as well as UDP.
+
+Does not currently handle packet defragmentation. This would be nontrivial to do right and to manage resources for
+while fragmentation on the IP level is rare for DNS packets. Fragmented packets can be all dumped for later analysis.
+**Alternative solution:** Capturing via Linux RAW `socket()` gives us IP-defragmented packets.
+
+TCP flow could be reconstructed, but it seems less of a priority. Currently a one-data-packet TCP streams 
+(not counting SYN, ACK and FIN packets) are processed, longer streams are dropped. Longer packets and
+long-open TCP connections seem to be uncommon.
+
+## Stats
+
+Very basic statistics for the collector (time, dropped/read packets, dropped frames), the timeframes (dropped/read packets),
+the outputs (dropped/read packets, dropped timeframes, written items and bytes before/after compression).
+Not clear what all to measure. Any DNS data statistics should be handled by an output plugin.
+
+Currently partially implemented.
+
+## Outputs
+
+Each output type extends a basic output structure. This basic structure contains the current open file and filename
+(or socket, etc.), time of opening, rotation period, compression settings, basic statistics (bytes written, frames dropped, ...)
+and hooks for packet writing, dumping, file closing and opening.
+
+Each output type (currently CSV, ProtoBuf, PCAP) extends this type with additional struct fields and sets the hooks
+appropriately on config (configuration handled by libucw). The current fields are:
+
+flags(IPv4/6,TCP/UDP) client-addr client-port server-addr server-port id qname qtype qclass request-time-us request-flags request-ans-rrs request-auth-rrs request-add-rrs request-length response-time-us response-flags response-ans-rrs response-auth-rrs response-add-rrs response-length
+
+
+Every output has a pathname template with strftime() replacement. An output can be compressed on the fly (which saves
+disk space and also write time). Fast compression (LZ4, ...) is preferred.
+
+# Memory usage limits
+The maximum length of the timeframe queue of every output should be bounded (and configurable).
+When exceeded, oldest timeframe not being currently processed should be dropped.
+Rationale: Together with timeframe size this predictably limits 
+total memory usage. Dropping data on lagging (e.g. IO-bound) outputs is preferable to dropping packets on input
+and therefore missing them on fast (e.g. counting) outputs.
+
+# Disk usage limits
+Optional. When approaching a per-output-file size limit, softly introduce query skipping.
+
+# CSV output
+Optional header line, configurable separator, configurable field set.
+Actually not much larger than Protocol Buffers when compressed (e.g. with just the very fast "lz4 -4": 33 B/query CSV, 29B/query ProtoBuf).
+Most commonly accepted format. No quoting necessary with e.g. "|" delimiter.
+
+# Protocol Buffer output
+Similar to CSV, configurable field set, one length-prefixed (16 bits) protobuf message per query.
+Library `protobuf-c` seems to use reflection when serialising rather than fully generated code (as it does in C++)
+so the speed is not great (comparable to CSV?).
+
+# PCAP
+Currently only used for dropped packets. Should be rate-limited (with softly increasing drop-rate).
+
+# Current state
+Timeframes ready for output are processed immediatelly in the main thread (no output queue, no rate limiting).
+
+## Inputs
+
+The input is either a single interface, or a list of pcap files to be processed in the given order.
+When reading pcap files, the "current" time follows the recorded times. 
+
+Multiple specified input interfaces (and nt just "all") would require multiple PCAPs (or traces) open, but libtrace
+does not seem to support polling on multiple traces. Advanced setups can be obtained by listening to "all" interfaces
+with kernel BPF filter.
+
+Multiple reader threads are hard to support, as the access to the query hash would have to be somehow guarded.
+Since the main congestion is expected to be at the outputs, this may not be problem. If required in the future,
+can be a (very) advanced feature.
+
+[Libtrace](http://research.wand.net.nz/software/libtrace.php) is preferred to tcpdump's PCAP for the larger
+feature set, implemented header and layer skipping, larger set of inputs (including kernel ring buffers).
+
+# Current state
+libPcap is used to read pcap files, live capture could be implemented easily but a switch to libtrace is
+expected (and will be as easily implemented there with additional benefits during parsing the layers).
+
+### Configuration / options
+
+Configuration is read by the [libucw configuration system](http://www.ucw.cz/libucw/doc/ucw/conf.html).
+Configuration should allow setting predictable limits on memory usage and potentially disk usage.
+CPU usage should be regulated by means of the OS (nice, cpulimit, cgroups).
+
+Reloading is supported only via program restart. Optionally, the program could wait until the outputs are rotated 
+(or at least until timeframe rotation). Shortly after program start, unmatched responses should be ignored.
+The amount of missed packets should not be significant relative to the frequency of such changes. Major
+
+Supporting online reconfiguration would greatly complicate program complexity and potentially introduce bugs and memory leaks.
+a potential exception could be the BPF filter string. What would be good use-cases or easily tunable parameters?
+
+### Language and libraries
+
+Language standart is C99. The proposed libraries are:
+* [Libtrace](http://research.wand.net.nz/software/libtrace.php) for packet capture, dissection and dumping.
+* [libUCW](http://www.ucw.cz/libucw/) for configuration parsing, logging, mempools (in the future?) and some data structures (currently doubly linked lists). Replacable but convinient.
+* libLZ4, libgz, ... for online (de)compression of input pcaps and output files. Partially implemented separately, but also part of libtrace.
+* protobuf-c for writing protocol buffers.
+
+### Logging and reports
+
+Currently using libucw logging system and configured via the same config file. Includes optional log file rotation.
+A sub-logger for potentially frequent messages with rate-limiting is also configured by default.
+
+Input and output statistics should be logged (e.g. on output file rotation). 
+Statistical outputs migh include some statistics. No other reporting mechanism is currently desined.
+
+### Questions
+
+* Libtrace vs libPCAP.
+  Currently: libPCAP. Tomas: in favor of libtrace.
+
+* One thread per output vs one writer thread.
+  Currently: No threads (WIP). Tomas: in favor of one thread per output
+
+* Runtime control and reconfiguration - how much control is desired and useful? How to implement it?
+  Currently: No runtime control.
+
+* Which output modules to support? CSV, Protobuf, counting stats (DSC-like?), CBOR, ...
+
+
+