Collector design draft and notes

Author: Tomáš Gavenčiak, tomas.gavenciak@nic.cz
Date: 22 Mar 2016

Language and libraries

Language standard is C99. The proposed libraries are:

Libtrace for packet capture, dissection and dumping.
libUCW for configuration parsing, logging, mempools (in the future?) and some data structures (currently doubly linked lists). Replaceable but convenient.
libLZ4, libgz, liblzma, ... for online (de)compression of input pcaps and output files.
CBOR: libcbor or other implementations?
LDNS or libKnot for DNS RRs parsing, or implement just request DNSSEC EDNS ourselves?

Operation and main structures

Struct collector

Main container for a collector instance (avoiding global variables).

Has

Configuration structure (given / loaded before init) (incl. inputs and outputs)
Current and previous timeframe
Writer thread(s)
Basic stats on program run (time, packets collected/dropped)
Collector-global lock on the shared state
Signal from the collector to the writers that more frames to write are available
Signal from the writers to the collector that more queue space is available (in offline mode)

Setup

Gets a configuration struct, initializes self, starts writer threads and opens a packet capture (file list or live, applying capture length, promiscuous settings and BPF filters).

Main thread operation

Main thread collects a packet from the input, parses its data (IP/UDP/DNS headers) and finds query matches. If the packet time is past the current timeframe, does a frame rotation (see below). When the packet is invalid (malformed, unsupported network feature, ...) drop it and optionally dump via some of the outputs (TODO:redesign dumping to use a separate thread).

Shared state, the locks and signals are accessed only a bounded number of times (cca 10-20) per every timeframe (not per packet!) so the overhead is very small.

Fame rotation

The timeframes are 0.1-10 sec long time windows (configurable). Any response packet is matched to a request packet in the current or the previous timeframe (so a response delayed up to the frame length is always matched). When a packet beyond the current timeframe is read, the frames are rotated: The previous timeframe is enqueued for writeout at every output, the current timeframe becomes the previous one and a new timeframe is created.

If a packet arrives out of order (with time smaller than the previous packet, as in wrong ordering of PCAP files), a warning is issued (TODO?) and it is processed as if it had the time of the last in-order packet.

Writer thread

One or more writer threads with private timeframe queues. The queue has limited length (configurable) to limit memory usage. When the queue is full, the main thread waits (offline mode) or drops the oldest not-currently-processed frame in the output queue. This way, there are at most (max. length of queue) + (number of outputs) + 2 live timeframes (the 2 are in the main thread). The timeframes have to be processed in the order of creation (two threads can not easily cooperate on single output).

If the new timeframe is beyond the current period of an output file, this output is rotated as well before writing the frame (see below).

Current state

Mostly as above. Multiple outputs of the same type can be open, even with different rotation opts. Common code base for output rotation, file naming, compression.
Per-output stats are not designed well.
Pcap dump output needs redesign.

Struct config

Holds collector configuration and configured inputs and outputs. Every input and output get theit own dns_input and dns_output struct. Configured via libucw configuration system.

Struct timeframe

Structure for queries within a time window (cca 1-10 sec, configurable). Contains all requests within that window, their matching responses within that or the next timeframe, and responses within this timeframe without a matching request.

Each timeframe is either in the main thread (current or previous frame) or in some of the output queues. The timeframes in the queues are refcounted to allow flexible timeframe dropping. The refcounts are under the global lock.

The separate queues allow e.g. dropping of timeframes on slow outputs (under heavy loads) while allowing fast outputs (e.g. counting stats) to see all the timeframes.

Has

List of packets to write - possibly with rate-limiting per timeframe (linked list).
Hash containing unmatched requests (by IPver, TCP/UDP, client/server port numbers, client/server IPs, DNS ID and QNAME) (can be freed on insertion to output queues).
Refcount (number of queues containing it)
Possibly(TODO?): a memory pool for all the packet data

Query hash

The hash is a fixed-size table of linked packet lists with configurable order (size = 1 << order). Rationale: rehashing could cause a lot of latency in the main thread. Having a reasonable limit on timeframe size to limit memory usage limits the requirement on hash size. (Even allowing 200MB of memory per timeframe means ~1M packets, comfortable 2M element table (order 21) takes 16MB). A big enough hash for the upper limit of packets in the timeframe (hard limit or just estimated) takes about 8% of memory of the packets, so a big enough table can be easily afforded within the expected memory usage. The hash is a linked list of packets in each bucket (with the "next" ptr within the packet struct).

Currently matching pairs on (IPver, transport, client IP, client port, server IP, server port, DNS ID, QNAME, QType, QClass)

QNAME, QType, QClass might not be present in all responses (e.g. NOT_IMPL)
server IP/port is redundant for ICANN, but might be useful for client watching (or upstream from DNS cache, ... ?)

Limiting memory use

The numer of requests (and unmatched responses) in the frame should be bounded by a configurable constant. This should be a soft limit (e.g. packet should be dropped more frequently when approaching the limit). When a request is accepted, its response should be always accepted.

Estimate: with limit 1Mq per frame, cca 200 B/q (in memory), 1 output and 5x 1s frames in the queue, 1.6GB of memory should suffice for all the packets.

Rationale: The packets in the timeframes take up most of collector memory. Since the memory use of a single packet is bounded by the packet capture bound plus a fixed overhead, bounding the packet number per timeframe is an easy and deterministic way to control memory usage (together with the number of timeframes).

Alternatives: Total packet count (for the entire collector) could better accomodate short-time bursts (spanning, say, 1-2 timeframes), but keeping these numbers in sync between the threads adds complexity. Also, this behaviour is less predictable. Another alternative is considering the total memory usage of the program. Not sure how technically viable and reliable (what to measure? would such memory usage shrink on free()?), and might not be very predictable.

Question: What to do with the (not dropped) responses to interface-dropped requests?

Struct packet

Holds data about a single parsed query packet. Using libtrace to handle packet data management and dissection, only the extracted network info and raw DNS data are stored in the packet. The DNS parsing is done by a simple header and QNAME label reading without compression. The remaining parts of the DNS message (various RRs) are not parsed (until we figure out how would they be useful). The full request DNS data might be stored (as it may be useful later for replays), only a part of the response (status code, metadata) might get stored.

Rationale: The data in RRs can be quite big and complex to know in advance what all we might want to know. The data from DNS header + QNAME + QType + QClass + DNSSEC-flags seem to carry enough information for statistics. The DNSSEC info will require some RR parsing (use ldns?). Replies should in principle be recomputable from requests. If it is necessary to store all the information, a full PCAP (in a separate process) could be more appropriate.

Has

Raw packet data: timestamp, real length, capture lenght
Adresses, ports, transport info
Raw DNS data (full for requests, truncated for responses), partially parsed DNS header (QNAME position and length)
Request may have a matching response packet. In this case the response is owned by the request
Next packet in hash bucket, next packet in timeframe

Packet network features

Handles both IPv4 and IPv6, as well as UDP.

Does not currently handle packet defragmentation. This would be nontrivial to do right and to manage resources for while fragmentation on the IP level is rare for DNS packets. Fragmented packets can be all dumped for later analysis. Alternative solution: Capturing via Linux RAW socket() gives us IP-defragmented packets.

TCP flow could and should be reconstructed, but it seems less of a priority. Previously, one-data-packet TCP streams (not counting SYN, ACK and FIN packets) were processed, longer streams dropped - this can be re-enabled relatively easily. Longer packets and long-open TCP connections seem to be uncommon, but might be important in the future.

Stats

Very basic statistics for the collector (time, dropped/read packets, dropped frames), the timeframes (dropped/read packets), the outputs (dropped/read packets, dropped timeframes, written items and bytes before/after compression). Not clear what all to measure. Any DNS data statistics (DSC-like) should be handled by an output plugin. Currently partially implemented and to be redesigned.

Outputs

Each output type extends a basic output structure. This basic structure contains the current open file and filename (or socket, etc.), time of opening, rotation period, compression settings, basic statistics (bytes written, frames dropped, ...) and hooks for packet writing, dumping, file closing and opening.

Each output type (currently CSV, defunct ProtoBuf, in the future CBOR) extends this type with additional struct fields and sets the hooks appropriately on config (configuration handled by libucw).

The current output fields are:

flags(IPv4/6,TCP/UDP) client-addr client-port server-addr server-port id qname qtype qclass request-time-us request-flags request-ans-rrs request-auth-rrs request-add-rrs request-length response-time-us response-flags response-ans-rrs response-auth-rrs response-add-rrs response-length

Every output has a pathname template with strftime() replacement. An output can be compressed on the fly (which saves disk space and also write time). Fast compression (LZ4, ...) is preferred, currently LZ4 is implemented. A packet-block-based compression can be enabled.

Memory usage limits

See the timeframe discussion.

Disk usage limits

Optional. When approaching a per-output-file size limit, softly introduce query skipping. Not implemented.

CSV output

Optional header line, configurable separator, configurable field set. Actually not much larger than Protocol Buffers when compressed (e.g. with just the very fast "lz4 -4": 33 B/query CSV, 29B/query ProtoBuf). Most commonly accepted format. No quoting necessary with e.g. "|" as the delimiter and "reasonable" QNAMES. Currently, unprintable chars and the separator in QNAME is replaced.

Protocol Buffer output

Similar to CSV, configurable field set, one length-prefixed (16 bits) protobuf message per query. Library protobuf-c seems to use reflection when serialising rather than fully generated code (as it does in C++) so the speed is not great (comparable to CSV?). Also, the length-prefixing is necessary and makes the output unreadable by standard Protobuf tools. Output code currenntly not updated and defunct.

Inputs

The input is either a list of live interface, or a list of pcap files to be processed in the given order. When reading pcap files, the "current" time follows the recorded times. For online capture, multiple interfaces are supported. Configurable promiscuous mode, BPF filtering.

Note: Multiple reader threads would be harder to support, as the access to the query hash would have to be somehow guarded, or the hashes be per-theread. Since the main congestion is expected to be at the outputs, this may not be a problem. If required in the future, can be a (very) advanced feature.

Libtrace is preferred to tcpdump's PCAP for the larger feature set, implemented header and layer skipping, larger set of inputs (including kernel ring buffers).

Configuration / options

Configuration is read by the libucw configuration system. Configuration should allow setting predictable limits on memory usage and potentially disk usage. CPU usage should be regulated by means of the OS (nice, cpulimit, cgroups).

Reloading config is supported only via program restart. Optionally, the program could wait until the outputs are rotated (or at least until timeframe rotation). Shortly after program start, unmatched responses should be ignored. The amount of missed packets should not be significant relative to the frequency of such changes.

Supporting online reconfiguration would complicate program complexity. What makes sense to be reconfigurable?

Logging and reports

Currently using libucw logging system and configured via the same config file. Includes optional log file rotation. A sub-logger for potentially frequent messages with rate-limiting is also configured by default.

Input and output statistics should be logged (e.g. on output file rotation). Statistical outputs migh include some statistics. No other reporting mechanism is currently desined.

Questions

Runtime control and reconfiguration - how much control is desired and useful? How to implement it? Currently: No runtime control.
Which output modules to support? CSV, Protobuf, counting stats (DSC-like?), CBOR, ...

Comments

Please register or sign in to add a comment.

Admin message

design notes