rrset - refactoring, memory optimizations

I suggest dividing into two issues - moving RRSIGs may be done independently and will require refactoring of a lot of code.

As for the refcounting, obvious solution is to separate CLASS, TYPE, and RDATA into separate structure and refcount it. TTL cannot be included, as it may differ between the RRSets. This, together with ordering of RDATA (for easier comparing) should be sufficient and relatively straightforward to implement.

Any other thoughts?

Can you form a minimal implementation proposal, so we have something to measure against and edit the issue description appropriately? @jkadlec

Possibly duplicated RR types: NS, CNAME, MX
First step has to be storing of DNAMEs directly in RDATA (should not take long to write)
I would definitely not change the knot_rrset_t structure in any way (tons of code would have to be rewritten, wire parsing and utilities included). We could move the owner pointer to the end of the structure to save some space when the owner is not needed.
We already have a DNAME duplicate checking function (uses trie), we'll use a similar approach to handle RRSet duplications. Our key: all of the RRs RDATA concatenated + TTL + maybe class. In theory, there's no limit how big this key can be, so let's say we introduce a threshold of 1k - RRSets longer than this will be duplicated. This lookup structure has to be freed after zone is loaded / transfer is done. To ensure that we return the memory to OS, use custom allocator.
knot_rrset_to_wire has to be changed - add a parameter with owner, and all the calls have to changed - this could be problematic, since in many places, there's no longer reference to knot_node_t structures.
Code loading zone files / doing AXFR has to be merged (arbitrary semantic checks as well) and changed to call duplicate checking function after the whole RRSet is loaded. This basically means, that if the RRSet is scattered into multiple RRs that are not right next to each other in zone file / AXFR, such RRSets will be duplicated.
We have to be careful when changing IXFR code - if we modify RRSet that is shared by multiple nodes, we have to create a copy (I understand that IXFR code already does it, @lslovak confirm please) a then the copy has to queried for duplicates. A typical example: one of the hosting companies has changed its NS record. This will effect all the records that are 'owned' by this company. This means that the refcount of 'old' RRSet will be gradually lowered to zero (if all the occurrences were changed), whereas the 'new' RRSet, one including the changed NS record, will have a refcount equal to the old one before the transfer. Unfortunately, we need the lookup structure to be built before we start applying the changesets, so we have to build our trie with RRSet duplicates before each IXFR (zone loading and AXFR will build it gradually, since there is no zone to use anyway). Still better love story than current implementation, which builds DNAME trie two times: one handles duplicates within the IXFR, and one is built when adjusting the zone (just a side note: in case we don't squeeze this change to 1.4, it has to be changed). If the transfer rate is really high, it might be more feasible to keep the structure in memory, but we'll see how it behaves.

Those are my thoughts for now, please comment.

How to do it

I agree, the first step is to store dnames directly in the data. Second step - zone file loading / AXFR code should be merged, good idea. Third step (this can take time) requires to clear the APIs a bit.

Problems

It is still not clear then what structure should be reference counted if not knot_rrset_t ? I understand that it's the most complicated at the moment, but I think it's good for several reasons:

We don't have clear distinction what structure represents what part of the RR
- knot_dname_t for owner
- knot_rrset_t for data
- knot_node_t as a glue for owner - data
- Maybe it's meant differently, but it shouldn't be "everyone is linked to everything"

Then it wouldn't be a problem to compress RRSets. But also, it wouldn't be possible to compress rrsets with different TTL/CLASS. But would it be a problem in real world scenario? I guess not. The amount of changed code shouldn't be so horrible, the only problem is to update APIs, so that it doesn't try to fetch owner from rrset, but rather have both rrset+owner or node (which is glue) supplied. Or is there something else?

TL;DR - I think first and second step should be done first, as it's refactoring. The third complicated step should be done after that, as it's more like a memory optimization and not something essential.

I don't like tie idea of not having owner (or pointer to it) in the RRSet structure - mostly because of the huge changes of dependent code it would require. However, it's probably the best in terms of memory consumption. The other solution (pointer to owner in the RRSet structure and thus separate RDATA structure to be refcounted) requires 2 more pointers per RRSet. The solution with owner moved to the end and not putting it there for zone RRSets would somehow reduce the amount of changes in code, but most of the answering and transfer code would have to be changed anyway. Can someone think of another solution?
If the owner is removed from RRSet, then the whole structure may be refcounted. As Marek mentioned, in real world scenario it shouldn't happen very often that similar RRSets have differents TTLs or CLASS.

I definitely agree with dividing into the three steps. Maybe even merging and testing them separately. It could reduce possible problems. Also, after the first step, the dname_refator branch with this change could be merged to master, or am I mistaken? It would save us one big ugly merge at the end of all these changes.

Calculating average of testing zones:

average NS length is 16B
each is repeated 88 times

If we removed refcounted triple {rdata, indices, count} and disconnected it from the knot_rrset_t, this would add 8B ptr to each knot_rrset_t a new allocation of the triple, but reusing it would save approx 51MB on our TLD zone minus the penalty of the added ptr to each non-NS rrset (which is estimated additional COST of 10-20M) and another allocation for each non-NS and unique NS rrset. Also refcount cost is not counted and expecting RRs to have only pointer to node owner, not a duplicate. This could be possible lowered by using flags and allocating variable length RRSet, BUT this would make shallow/deep copy and freeing even more tricky than it is now.

If we refcounted whole rrset and removed owner and rrsigs, this would save approx 115MB and some minor savings (not counted, estimating zero) for each non-NS node as the two pointers would be removed from the rrset structure to node.

Both approaches would require quite a lot of changes, the latter probably more. In addition - since RRSet usually contains just a few RRs, we could ditch the indices and use it like the way labels are stored in dname. RDATA would look like: [16bit length] data [16bit length] data. This would require N pointer skips for access to Nth RDATA, sequential access would be free.

For comparison - we need to share dname between node and it's rrsets, not having a duplicates. This would save for the same zone about 40+24M, as each node has about 1-2 RRSets (thus 1.5 superfluous names) and average node owner length is 15B (37B for NSEC3, 1 superfluous per node). So this seems like a feasible first step, as it won't break much of the API.

Savings are counted for the 64bit machine, separate allocations are rounded to the nearest upper neighbour of 8B to accomodate allocation overhead. To give it a sense of scale, whole zone is 767MB in memory atm.

The node-ratio was probably a bit off and some allocator costs are in there. When the dname is shared between node and rrsets, the savings are 52M for mentioned zone. Estimate was 64M.

I'm not sure I understand: what do you mean by 'each is repeated 88 times'? Some NS RRSets are used 10k+ times.

Meaning: there are C*88 NS RRSets from which C are unique.

List of things that need to be addressed first to reduce the amount of changes:

Keep indices as sizes or use inline [length]data for rdata items.
- Keep: Sorting could be done with reordering of indices, not the actual data (if we add the first element to indices :( ). And random access (if it's desirable at all)
- Replace: Less space, simple length calculation, similar approach like with dnames.
- Anyway: Lot code of code (serialization, deserialization...) iterates the rdata items to calculate each length over again. Most of it could be reduced to memcpy by either following indices or using item lengths in the array. Also sorting does both, not much sense in that.
Remove the duplicated code to make transition easier.
Signatures must be moved to node, this may be problematic with current API.

Very good point with reordering just indices, I did not think of that, but speed would not improve much, ihmo not needed now. But why do we even care about this side (RDATA, that is) of the code, ideally, it should not change at all, provided we keep the RRSet structure mostly intact.

Keep: simpler length calculation than Replace.
Against Replace: more complicated work with wire format of RDATA (e.g. when signing).

Regarding moving signatures to node - be extra careful, they cannot be merged into one RRSet. But since RRSets in node are now kept in a simple array, it should be OK to just insert all RRSIG RRSets one after another, or even better - each after the RRSet it signs. (Note that if I write "RRSIG RRSet", I mean set of RRSIG RRs covering one RRSet.)

It is indeed not a requirement for refcounting, but I care not for the speed improvements, but for the code complexity reduction. For me, it's easier to refactor something smaller rather than large and now it's not clear enough where I work with the rrset itself and/or rdata.

Replace: hard to say, managing indices also require extra code and space. If only the indices would be reordered, memcmp itself wouldn't be sufficient for comparison (because the data array would be still unsorted). The cost of replacing it would be extra loop when we want to do something with the rdata as a whole, like - compare, sign and I can't think of what else.

With indices, it could be compared like: memcmp(r1, r2, length). With inline item sizes: while(item_size(r1) > 0) { memcmp(r1, r2, item_size(r1)); r1 = next.., r2 = next.. } or so. Truth is, I am used to this from dnames so for me it is consistent, but I also like benefits of indices for sorting.

Well, OK, i still counted with ordered RDATA instead of indices. I thought the goal is to simplify the usage as much as possible, even at the cost of speed (ordering the RDATA).

If it's ordering by indices vs. inline item sizes, then the latter is probably better: usage is equally complicated and it saves some space.

I have updated the issue with suggestions for changes and steps, lots of stuff is up to implementor or will clear up when you start the works. Take them with a grain of salt and provide your reasoning in the issues created for each part.

Some comments:

All the iterations are very similar, only specialty is compressing/deflating names - Agreed, but how to fix this? Some might be removed or merged, or we could add an apply function with three function pointers - one handling fixed size items, one for compressed dnames, one for remainder and I think we still need a special case for NAPTR, sadly (badly engineered RFC, if you ask me).
Also split the long functions - agreed
Compression is wonky with 2-3 different structures - Well, I took that code as-was from response.c, I know there were some changes to compression API, I'll look into that.
Writing to journal (aka binary) and to wire should be the same .... - yep
Reordering should be done depending on how the RDATA is stored, either indices only or RDATA - I'd rather not change this, sort both RDATA and indices. Maybe we could use lengths instead of offsets (but still separated, else we lose all the benefits connected to parsing and so on), we'd save some space this way, but access to random RRs would be more difficult and slower.
Also replace xmalloc like functions when you see them ... okay, makes sense I guess
This is more a question: replace direct accessors with to extra value with direct access ... - I mostly agree, but we'd need a rule saying that there either is an API and then you have to use it, or there isn't and then you do direct access. Otherwise we might end up in situation when you set(get) a value, but miss some needed operation.
Very important - clear questions and code review remarks from comments. - I, for one, don't mind questions in comments, if they really make sense, but okay.
Even static functions which are not oneliners ... Well, when a function is called knot_rrset_add_rr_sort_n it's obvious it's not for fetching bananas, but fine :trollface:
Split somehow functions that manipulate with rrset and functions that interpret/process the rdata. - Agreed, maybe one file for one type would be best.
No 1/0 parameters like ddns_check in knot_rrset_remove_rr_using_rrset - Agreed, but maybe for one on/off switch a bool argument is sufficient, when you need more, use flags?
Ordering of the functions should be the same as in header and should be sensible -fine

Good points overall, thanks

Processing - One callback with parameters {void* context, unsigned item_type, uint8_t* src, unsigned length} would be fine. The context could be a specific type carrying things like pointer to wire, wire pos, compression table and so on. Basically all the callback needs to do is either writing the rdata item somewhere (packet/journal->rrset or vice versa) or checking it. Any RDATA item (fixed, remainder, dname, whatever) should be passed to this callback along with the type. Or instead of callback, it could accept a structure with pointers to functions and call function for compressed names/rest accordingly. Just give it a thought, I'm sure you can work something out when you write it.

Reordering - hmm, I think parser gives you each RDATA item separately no? Then you wouldn't lose any benefit by storing lengths inline. I thought about it and the compare could still be done by one single memcmp. Presuming the RDATA items are sorted and each is prepended by lengths, then the records are equal if the lengths match and the data match. It won't be canonical comparison like with dnames, but I don't think that is required. Also how many items are in RDATA on average? Tens, hundreds? Then it would be slower to for example repetitively access the 101th item. And how often do you need random access? My experience from the dnames and tries is that jumping through max 127 labels is very very fast and hardly ever a bottleneck. I obviously think the benefits of inline sizes are bigger IF you don't use the indices for sorting, although I will accept any other solution if it would be more readable/considerably faster/less code. So before you decide, please think about the numbers first or even try and measure both approaches.

Direct access - fine, I propose we write it to the coding style. If the structure is public, you can access it's members directly. If the implementor doesn't want you to do that, hide the implementation-specific stuff using opaque pointers, prefix the member with underscore or write a comment. Good rule of thumb is looking at you usage pattern, I tend to choose the easiest way. If rrset->data is available, I'll choose it every time instead of boring myself to death with knot_rrset_get_rdata(rrset) (this is hypothetical situation). :hurtrealbad:

Questions in comments - fine if they are helpful questions, not fine if they are questions like [code review] What does it do? or I don't understand this.

Function fetching bananas - I know, and I am not asking you to write a paragraph and seal it with ASCII art diagrams. I ask for a single line. You are biased because you wrote it, but I see that it adds something, probably a RR to sorted set or it adds a RR and sorts the RRSet? And why is there n? I get the meaning eventually, but one descriptive line would save me from all this.

1/0 parameters - enum/flag is always better unless it's trivial. Honestly, most of the time something like this is even needed (exceptions) I immediately think that there is something fishy with the overall design and the exception is a bailout. It may not be true, just be careful with it.

Reordering - Now it does, but it should be much faster when just one block is given to function creating the zone. Comparison with inline lenghts would not be an issue, probably. Honestly, random access is not an issue either, it's needed scarcely. And now that I think about it, storing legths inline actually makes sense, but only for whole RRs not the actual items within one RR. This way, it would not break anything, since you never actually work with whole RRSet when, say, sending a message, you only work with RRs, and usually you iterate through all of them. It might be worth it after all, but iff we do not break the actual wireformat, let's see what @lslovak has got to say about it, but it could save us some memory (specifically, one pointer to the array, and uint16_t for each RR, since we'd use lengths and not offsets) and we could get rid of code handling indices, which is quite complicated sometimes. But in order for this to work fine, we'd need to change workflow with RRs a bit, especially all the RDATA get functions would not make sense anymore. Anyway, I think that the solution with indices is quite elegant, but if we end up with less code and less memory used, who I am to oppose this.

Direct access - I'd use underscores, but I think those are banned in our original coding style. But I agree that using getters is quite dull sometimes, especially when you chain them.

Banana function - I agree that some function names are not quite descriptive.

I agree with the rest.

mentioned in issue #157 (closed)

mentioned in issue #83 (closed)

Status changed to closed

mentioned in issue #445 (closed)

mentioned in merge request !507 (merged)

mentioned in commit 3a213378

removed milestone

mentioned in commit 93f3f0b4

mentioned in commit 77bd12d4

mentioned in commit 0fc6bbd1

mentioned in commit d644e00a

mentioned in commit 40f6c0f5

mentioned in commit af3ec1d9

mentioned in commit f85afb81

mentioned in commit 69b17ed1

mentioned in commit 58476f67

mentioned in commit e60e5f5d

mentioned in commit 3f90e444

mentioned in commit 9a76f4b6

mentioned in commit 80ddb29c

mentioned in commit 612479e9

mentioned in commit 2061615b

mentioned in commit c0d78899

mentioned in commit 7f8e1c9b

mentioned in commit 08c38698

mentioned in commit c0ab9d99

mentioned in commit c8748c64

mentioned in commit d7a20c8c

mentioned in commit a7c9c0c2

mentioned in commit b450d608

mentioned in commit 9c849319

mentioned in commit b9e1e9bd

mentioned in issue #871 (closed)

mentioned in commit f2f503a7

mentioned in commit 2a1db055

mentioned in commit 9064c42c

mentioned in commit d49ba189

mentioned in commit 2b6c77e6

mentioned in commit aa090ae0

mentioned in commit 3ae3fbdf

mentioned in commit 0c731838

mentioned in commit 2a00c0ee

mentioned in commit 792934b5

mentioned in commit 23f450ff

mentioned in issue #940 (closed)

mentioned in commit 2d97d0f0

mentioned in commit 078a6f8c

rrset - refactoring, memory optimizations

What

Summary of ideas from the discussion

Steps

Designs

Child items ...

Activity

How to do it

Problems

Admin message

rrset - refactoring, memory optimizations

What

Summary of ideas from the discussion

Steps

Activity

How to do it

Problems