Storage format and schema
CBOR-based format proposal
TODO(tomas): make a sketch based on ideas from discussion with Sinodun
Formats
(Updated Tomáš's writeup from dns-stat-devels)
CBOR
An elegant schema-less format, nice C and other bindings and parser, RFC. Messages are self-delimited and can be concatenated into file/socket.
- Plus: Open, RFC, libs for many langs, nice streaming C API (easy writing in C), self-delimited messages, for us: can be compact when tables are used, fast
- Minus: no schema (more work to read in statically typed languages), less used in the ecosystems of Hadoop etc.
Object representation
One disadvantage of CBOR (as well as other schema-less formats, such as JSON, BSON and Message Pack) is that the natural representation of a record is a "atrribute name":value dictionary, which takes a lot of space on the wire. This can be partially fixed by some LZW-like compression, but the cost (both power and complexity) of dealing with string keys on writing and parsing stays.
Another object representation is to use index:value dictionaries instead (a bit like Protobuf does), but that requires a schema definition and a library for such format has to be written in every language separately, and no generic CBOR tools would be aware of the schema. Also, the dictionary representation is not as compact as Protobuf field numbering. Probor proposes to use arrays instead of index:value dictionaries, like a CSV with a permanently fixed column set. This can be also storage inefficient (imagine many optional columns).
John mentioned Probor (https://github.com/tailhook/probor) and it seems to be a great idea (Protobuf-like over CBOR), but it is currently just a proof of concept. (Should we perhaps help develop it, then? :-)
For our purpose (DNS requests), we could use the CSV-like representation (each record a simple array with given fields) for every time-frame (I imagine every 5-300 sec) with a nicely-presented header record at the start of the frame/file containing the names of present "columns" (as well as other metadata, like data source machine and basic stats ...). Still we would have to implement this format ourselves in C, Java and likely Python, and it is not too friendly to outsiders.
Reading and type-checking
In strongly-typed languages it is not trivial to write or read a schema-less format into objects/structs -- you have to semi-manually branch on the field name (or index), check the type and assign it to the attribute. Protobuf and Cap'n'proto define and populate these objects/structs for you.
Protocol Buffer
- Plus: Open, libs for many langs, fast, very compact, schema-based, nice C struct-based api, well-known
- Minus: No streaming in C api, messages not self-delimited, C libs use reflection to write (slower than in C++, but speed comparable to CBOR/...), proto ver 2/3 syntax schism
Nice and established format with a schema. Good APIs for C and others, APIs for both compiled (fast) and dynamic (reflection-based, slower) reading. Very compact wire-format, quite fast. Two versions (proto2 and proto3, with a schism a bit like python2 vs python3 :-) but binary compatible.
Can read only entire messages/objects in most implementations, so objects can not be too large (<1MB). Therefore, we can not have an entire time-frame as a single object (one header, multiple records) as would be most convenient. To serialize more messages into a file/socket, some additional framing must be used. Simple length-prefixed messages are suggested at https://developers.google.com/protocol-buffers/docs/techniques#streaming
An advantage is that given the schema, the format is easily readable in many languages and several tools (there even is a wireshark plugin).
Cap'n'proto
- Plus: speed
- Minus: not very compact, no C reading API, not very well-known, complicated format
Speed-oriented schema format, wire-format is a bit wasteful (64-bit words, and so it has its own "packed" version to remove the extra zeros). Seems to be designed for C++ (very fast there, not sure how it would perform in JVM). A bit complicated struct layout (as noted in the Probor discussion).
Somewhat new, C and Java only have writing API (!).
Other formats
Other schema-less formats are rather similar to CBOR in the main aspects (JSON, BSON, MessagePack, ...). Other schema-based formats include Thrift and Avro -- they do not seem to be more interesting than Protocol Buffers but I do not know much about them.
I look forward to any comments, experiences and ideas!
Databases
- A NoSQL comparison (Cassandra, Couchbase, HBase, MongoDB; but only 100 byte packets): PDF