rrset - refactoring, memory optimizations
What
There are two problems with knot_rrset_t
- the code has repetitive nature and is not very clear and rrsets take up space.
Summary of ideas from the discussion
- This is not a priority for the upcoming release, but the work should slowly start.
- Ordering is also a matter of discussion and is up to implementor with a reasonable explanation.
- Possibly duplicated RR types: NS, CNAME, MX make sense for compression.
- Pros/Cons of indices or inline lengths are in discussion.
- Make branch for each partial change to make reviews easier.
Steps
- All the iterations are very similar, only specialty is compressing/deflating names
- Also split the long functions
- Compression is wonky with 2-3 different structures, only
{wire, pos, table}
anddname
is required for compression. - Writing to journal (aka binary) and to wire should be the same, writing to wire is hardly a drag and would provide less code and less potential bugs.
- Reordering should be done depending on how the RDATA is stored, either indices only or RDATA.
- Also replace
xmalloc
like functions when you see them, therrset
should be a part of the core library. - This is more a question: replace direct accessors with to extra value with direct access, like
knot_rrset_set_ttl(rrset, 1)
has no extra value overrrset->ttl = 1
and is only longer to write. This is not a C++ and the structures are not opaque. Great way to fix this is http://coccinelle.lip6.fr/sp.php When the function has any extra benefit, than by all means provide a function. - Very important - clear questions and code review remarks from comments.
- Even static functions which are not oneliners deserve a short comment of what they do. e.g.
knot_rrset_add_rr_sort_n
- Split somehow functions that manipulate with
rrset
and functions that interpret/process therdata
. -
Very important - no 1/0 parameters like
ddns_check
inknot_rrset_remove_rr_using_rrset
. I have 99 problems with it and here are two good - caller has no immediate idea of what the value does, handling special cases like that almost always means wrong design. - Very important - ordering of the functions should be the same as in header and should be sensible, functions that are related somehow should be grouped together.
- Change descriptors so that uncompressed DNAMEs no longer exist.
No child items are currently assigned. Use child items to break down this issue into smaller parts.
Link issues together to show that they're related. Learn more.
Activity
- Author Contributor
I suggest dividing into two issues - moving RRSIGs may be done independently and will require refactoring of a lot of code.
As for the refcounting, obvious solution is to separate CLASS, TYPE, and RDATA into separate structure and refcount it. TTL cannot be included, as it may differ between the RRSets. This, together with ordering of RDATA (for easier comparing) should be sufficient and relatively straightforward to implement.
Any other thoughts?
- Author Contributor
Can you form a minimal implementation proposal, so we have something to measure against and edit the issue description appropriately? @jkadlec
- Author Contributor
- Possibly duplicated RR types: NS, CNAME, MX
- First step has to be storing of DNAMEs directly in RDATA (should not take long to write)
- I would definitely not change the
knot_rrset_t
structure in any way (tons of code would have to be rewritten, wire parsing and utilities included). We could move theowner
pointer to the end of the structure to save some space when the owner is not needed. - We already have a DNAME duplicate checking function (uses trie), we'll use a similar approach to handle RRSet duplications. Our key: all of the RRs RDATA concatenated + TTL + maybe class. In theory, there's no limit how big this key can be, so let's say we introduce a threshold of 1k - RRSets longer than this will be duplicated. This lookup structure has to be freed after zone is loaded / transfer is done. To ensure that we return the memory to OS, use custom allocator.
-
knot_rrset_to_wire
has to be changed - add a parameter with owner, and all the calls have to changed - this could be problematic, since in many places, there's no longer reference toknot_node_t
structures. - Code loading zone files / doing AXFR has to be merged (arbitrary semantic checks as well) and changed to call duplicate checking function after the whole RRSet is loaded. This basically means, that if the RRSet is scattered into multiple RRs that are not right next to each other in zone file / AXFR, such RRSets will be duplicated.
- We have to be careful when changing IXFR code - if we modify RRSet that is shared by multiple nodes, we have to create a copy (I understand that IXFR code already does it, @lslovak confirm please) a then the copy has to queried for duplicates. A typical example: one of the hosting companies has changed its NS record. This will effect all the records that are 'owned' by this company. This means that the refcount of 'old' RRSet will be gradually lowered to zero (if all the occurrences were changed), whereas the 'new' RRSet, one including the changed NS record, will have a refcount equal to the old one before the transfer. Unfortunately, we need the lookup structure to be built before we start applying the changesets, so we have to build our trie with RRSet duplicates before each IXFR (zone loading and AXFR will build it gradually, since there is no zone to use anyway). Still better love story than current implementation, which builds DNAME trie two times: one handles duplicates within the IXFR, and one is built when adjusting the zone (just a side note: in case we don't squeeze this change to 1.4, it has to be changed). If the transfer rate is really high, it might be more feasible to keep the structure in memory, but we'll see how it behaves.
Those are my thoughts for now, please comment.
- Author Contributor
How to do it
I agree, the first step is to store dnames directly in the data. Second step - zone file loading / AXFR code should be merged, good idea.
Third step (this can take time) requires to clear the APIs a bit.Problems
It is still not clear then what structure should be reference counted if not
knot_rrset_t
? I understand that it's the most complicated at the moment, but I think it's good for several reasons:- We don't have clear distinction what structure represents what part of the RR
-
knot_dname_t
for owner -
knot_rrset_t
for data -
knot_node_t
as a glue for owner - data - Maybe it's meant differently, but it shouldn't be "everyone is linked to everything"
-
Then it wouldn't be a problem to compress RRSets. But also, it wouldn't be possible to compress rrsets with different TTL/CLASS. But would it be a problem in real world scenario? I guess not. The amount of changed code shouldn't be so horrible, the only problem is to update APIs, so that it doesn't try to fetch owner from rrset, but rather have both rrset+owner or node (which is glue) supplied. Or is there something else?
TL;DR - I think first and second step should be done first, as it's refactoring. The third complicated step should be done after that, as it's more like a memory optimization and not something essential.
- We don't have clear distinction what structure represents what part of the RR
- Author Contributor
- I don't like tie idea of not having owner (or pointer to it) in the RRSet structure - mostly because of the huge changes of dependent code it would require. However, it's probably the best in terms of memory consumption. The other solution (pointer to owner in the RRSet structure and thus separate RDATA structure to be refcounted) requires 2 more pointers per RRSet. The solution with owner moved to the end and not putting it there for zone RRSets would somehow reduce the amount of changes in code, but most of the answering and transfer code would have to be changed anyway. Can someone think of another solution?
- If the owner is removed from RRSet, then the whole structure may be refcounted. As Marek mentioned, in real world scenario it shouldn't happen very often that similar RRSets have differents TTLs or CLASS.
I definitely agree with dividing into the three steps. Maybe even merging and testing them separately. It could reduce possible problems. Also, after the first step, the dname_refator branch with this change could be merged to master, or am I mistaken? It would save us one big ugly merge at the end of all these changes.
- Author Contributor
Calculating average of testing zones:
- average NS length is 16B
- each is repeated 88 times
If we removed refcounted triple
{rdata, indices, count}
and disconnected it from theknot_rrset_t
, this would add 8B ptr to eachknot_rrset_t
a new allocation of the triple, but reusing it would save approx 51MB on our TLD zone minus the penalty of the added ptr to each non-NS rrset (which is estimated additional COST of 10-20M) and another allocation for each non-NS and unique NS rrset. Also refcount cost is not counted and expecting RRs to have only pointer to node owner, not a duplicate. This could be possible lowered by using flags and allocating variable length RRSet, BUT this would make shallow/deep copy and freeing even more tricky than it is now.If we refcounted whole rrset and removed owner and rrsigs, this would save approx 115MB and some minor savings (not counted, estimating zero) for each non-NS node as the two pointers would be removed from the rrset structure to node.
Both approaches would require quite a lot of changes, the latter probably more. In addition - since RRSet usually contains just a few RRs, we could ditch the indices and use it like the way labels are stored in dname. RDATA would look like:
[16bit length] data [16bit length] data
. This would require N pointer skips for access to Nth RDATA, sequential access would be free.For comparison - we need to share dname between node and it's rrsets, not having a duplicates. This would save for the same zone about 40+24M, as each node has about 1-2 RRSets (thus 1.5 superfluous names) and average node owner length is 15B (37B for NSEC3, 1 superfluous per node). So this seems like a feasible first step, as it won't break much of the API.
Savings are counted for the 64bit machine, separate allocations are rounded to the nearest upper neighbour of 8B to accomodate allocation overhead. To give it a sense of scale, whole zone is 767MB in memory atm.
- Author Contributor
The node-ratio was probably a bit off and some allocator costs are in there. When the dname is shared between node and rrsets, the savings are 52M for mentioned zone. Estimate was 64M.
- Author Contributor
I'm not sure I understand: what do you mean by 'each is repeated 88 times'? Some NS RRSets are used 10k+ times.
- Author Contributor
Meaning: there are C*88 NS RRSets from which C are unique.
- Author Contributor
List of things that need to be addressed first to reduce the amount of changes:
- Keep indices as sizes or use inline
[length]data
for rdata items.- Keep: Sorting could be done with reordering of indices, not the actual data (if we add the first element to indices :( ). And random access (if it's desirable at all)
- Replace: Less space, simple length calculation, similar approach like with dnames.
- Anyway: Lot code of code (serialization, deserialization...) iterates the rdata items to calculate each length over again. Most of it could be reduced to memcpy by either following indices or using item lengths in the array. Also sorting does both, not much sense in that.
- Remove the duplicated code to make transition easier.
- Signatures must be moved to node, this may be problematic with current API.
- Keep indices as sizes or use inline
- Author Contributor
Very good point with reordering just indices, I did not think of that, but speed would not improve much, ihmo not needed now. But why do we even care about this side (RDATA, that is) of the code, ideally, it should not change at all, provided we keep the RRSet structure mostly intact.
- Author Contributor
- Keep: simpler length calculation than Replace.
- Against Replace: more complicated work with wire format of RDATA (e.g. when signing).
Regarding moving signatures to node - be extra careful, they cannot be merged into one RRSet. But since RRSets in node are now kept in a simple array, it should be OK to just insert all RRSIG RRSets one after another, or even better - each after the RRSet it signs. (Note that if I write "RRSIG RRSet", I mean set of RRSIG RRs covering one RRSet.)
- Author Contributor
It is indeed not a requirement for refcounting, but I care not for the speed improvements, but for the code complexity reduction. For me, it's easier to refactor something smaller rather than large and now it's not clear enough where I work with the rrset itself and/or rdata.
- Replace: hard to say, managing indices also require extra code and space. If only the indices would be reordered, memcmp itself wouldn't be sufficient for comparison (because the data array would be still unsorted). The cost of replacing it would be extra loop when we want to do something with the rdata as a whole, like - compare, sign and I can't think of what else.
With indices, it could be compared like:
memcmp(r1, r2, length)
. With inline item sizes:while(item_size(r1) > 0) { memcmp(r1, r2, item_size(r1)); r1 = next.., r2 = next.. }
or so. Truth is, I am used to this from dnames so for me it is consistent, but I also like benefits of indices for sorting. - Author Contributor
Well, OK, i still counted with ordered RDATA instead of indices. I thought the goal is to simplify the usage as much as possible, even at the cost of speed (ordering the RDATA).
- Author Contributor
If it's ordering by indices vs. inline item sizes, then the latter is probably better: usage is equally complicated and it saves some space.
- Author Contributor
I have updated the issue with suggestions for changes and steps, lots of stuff is up to implementor or will clear up when you start the works. Take them with a grain of salt and provide your reasoning in the issues created for each part.
- Author Contributor
Some comments:
-
All the iterations are very similar, only specialty is compressing/deflating names - Agreed, but how to fix this? Some might be removed or merged, or we could add an apply function with three function pointers - one handling fixed size items, one for compressed dnames, one for remainder and I think we still need a special case for NAPTR, sadly (badly engineered RFC, if you ask me).
-
Also split the long functions - agreed
-
Compression is wonky with 2-3 different structures - Well, I took that code as-was from response.c, I know there were some changes to compression API, I'll look into that.
-
Writing to journal (aka binary) and to wire should be the same .... - yep
-
Reordering should be done depending on how the RDATA is stored, either indices only or RDATA - I'd rather not change this, sort both RDATA and indices. Maybe we could use lengths instead of offsets (but still separated, else we lose all the benefits connected to parsing and so on), we'd save some space this way, but access to random RRs would be more difficult and slower.
-
Also replace
xmalloc
like functions when you see them ... okay, makes sense I guess -
This is more a question: replace direct accessors with to extra value with direct access ... - I mostly agree, but we'd need a rule saying that there either is an API and then you have to use it, or there isn't and then you do direct access. Otherwise we might end up in situation when you set(get) a value, but miss some needed operation.
-
Very important - clear questions and code review remarks from comments. - I, for one, don't mind questions in comments, if they really make sense, but okay.
-
Even static functions which are not oneliners ... Well, when a function is called
knot_rrset_add_rr_sort_n
it's obvious it's not for fetching bananas, but fine :trollface: -
Split somehow functions that manipulate with rrset and functions that interpret/process the rdata. - Agreed, maybe one file for one type would be best.
-
No 1/0 parameters like ddns_check in knot_rrset_remove_rr_using_rrset - Agreed, but maybe for one on/off switch a bool argument is sufficient, when you need more, use flags?
-
Ordering of the functions should be the same as in header and should be sensible -fine
Good points overall, thanks
-
- Author Contributor
Processing - One callback with parameters
{void* context, unsigned item_type, uint8_t* src, unsigned length}
would be fine. The context could be a specific type carrying things like pointer to wire, wire pos, compression table and so on. Basically all the callback needs to do is either writing the rdata item somewhere (packet/journal->rrset or vice versa) or checking it. Any RDATA item (fixed, remainder, dname, whatever) should be passed to this callback along with the type. Or instead of callback, it could accept a structure with pointers to functions and call function for compressed names/rest accordingly. Just give it a thought, I'm sure you can work something out when you write it.Reordering - hmm, I think parser gives you each RDATA item separately no? Then you wouldn't lose any benefit by storing lengths inline. I thought about it and the compare could still be done by one single memcmp. Presuming the RDATA items are sorted and each is prepended by lengths, then the records are equal if the lengths match and the data match. It won't be canonical comparison like with dnames, but I don't think that is required. Also how many items are in RDATA on average? Tens, hundreds? Then it would be slower to for example repetitively access the 101th item. And how often do you need random access? My experience from the dnames and tries is that jumping through max 127 labels is very very fast and hardly ever a bottleneck. I obviously think the benefits of inline sizes are bigger IF you don't use the indices for sorting, although I will accept any other solution if it would be more readable/considerably faster/less code. So before you decide, please think about the numbers first or even try and measure both approaches.
Direct access - fine, I propose we write it to the coding style. If the structure is public, you can access it's members directly. If the implementor doesn't want you to do that, hide the implementation-specific stuff using opaque pointers, prefix the member with underscore or write a comment. Good rule of thumb is looking at you usage pattern, I tend to choose the easiest way. If
rrset->data
is available, I'll choose it every time instead of boring myself to death withknot_rrset_get_rdata(rrset)
(this is hypothetical situation). :hurtrealbad:Questions in comments - fine if they are helpful questions, not fine if they are questions like
[code review] What does it do?
orI don't understand this.
Function fetching bananas - I know, and I am not asking you to write a paragraph and seal it with ASCII art diagrams. I ask for a single line. You are biased because you wrote it, but I see that it adds something, probably a RR to sorted set or it adds a RR and sorts the RRSet? And why is there n? I get the meaning eventually, but
one descriptive line would save me from all this.1/0 parameters - enum/flag is always better unless it's trivial. Honestly, most of the time something like this is even needed (exceptions) I immediately think that there is something fishy with the overall design and the exception is a bailout. It may not be true, just be careful with it.
- Author Contributor
Reordering - Now it does, but it should be much faster when just one block is given to function creating the zone. Comparison with inline lenghts would not be an issue, probably. Honestly, random access is not an issue either, it's needed scarcely. And now that I think about it, storing legths inline actually makes sense, but only for whole RRs not the actual items within one RR. This way, it would not break anything, since you never actually work with whole RRSet when, say, sending a message, you only work with RRs, and usually you iterate through all of them. It might be worth it after all, but iff we do not break the actual wireformat, let's see what @lslovak has got to say about it, but it could save us some memory (specifically, one pointer to the array, and
uint16_t
for each RR, since we'd use lengths and not offsets) and we could get rid of code handling indices, which is quite complicated sometimes. But in order for this to work fine, we'd need to change workflow with RRs a bit, especially all the RDATA get functions would not make sense anymore. Anyway, I think that the solution with indices is quite elegant, but if we end up with less code and less memory used, who I am to oppose this.Direct access - I'd use underscores, but I think those are banned in our original coding style. But I agree that using getters is quite dull sometimes, especially when you chain them.
Banana function - I agree that some function names are not quite descriptive.
I agree with the rest.
- Ghost User mentioned in issue #157 (closed)
mentioned in issue #157 (closed)
- Ghost User mentioned in issue #83 (closed)
mentioned in issue #83 (closed)
- Ghost User Status changed to closed
Status changed to closed
- Jan Včelák mentioned in issue #445 (closed)
mentioned in issue #445 (closed)
- Jan Včelák mentioned in merge request !507 (merged)
mentioned in merge request !507 (merged)
- Daniel Salzman mentioned in commit 3a213378
mentioned in commit 3a213378
- Daniel Salzman removed milestone
removed milestone
- Daniel Salzman mentioned in commit 93f3f0b4
mentioned in commit 93f3f0b4
- Daniel Salzman mentioned in commit 77bd12d4
mentioned in commit 77bd12d4
- Daniel Salzman mentioned in commit 0fc6bbd1
mentioned in commit 0fc6bbd1
- Daniel Salzman mentioned in commit d644e00a
mentioned in commit d644e00a
- Daniel Salzman mentioned in commit 40f6c0f5
mentioned in commit 40f6c0f5
- Daniel Salzman mentioned in commit af3ec1d9
mentioned in commit af3ec1d9
- Daniel Salzman mentioned in commit f85afb81
mentioned in commit f85afb81
- David Vasek mentioned in commit 69b17ed1
mentioned in commit 69b17ed1
- David Vasek mentioned in commit 58476f67
mentioned in commit 58476f67
- David Vasek mentioned in commit e60e5f5d
mentioned in commit e60e5f5d
- David Vasek mentioned in commit 3f90e444
mentioned in commit 3f90e444
- David Vasek mentioned in commit 9a76f4b6
mentioned in commit 9a76f4b6
- David Vasek mentioned in commit 80ddb29c
mentioned in commit 80ddb29c
- David Vasek mentioned in commit 612479e9
mentioned in commit 612479e9
- David Vasek mentioned in commit 2061615b
mentioned in commit 2061615b
- David Vasek mentioned in commit c0d78899
mentioned in commit c0d78899
- David Vasek mentioned in commit 7f8e1c9b
mentioned in commit 7f8e1c9b
- David Vasek mentioned in commit 08c38698
mentioned in commit 08c38698