Jan Včelák · 29976dc6
--- a/design-new-zone-api.md
+++ b/design-new-zone-api.md
+### New zone API
+We need a new zone API, here's what's wrong about the current one :sweat: :
+- Consistency and transactions are handled by user, not by the zone structure itself - code is flooded with `rcu_` calls and needless to say, it's very error-prone.
+- Changes in zone (a.k.a. `knot_changesets_t`) are handled by the user as well, lot's of duplicated code because of this.
+- User has to take care of journal as well, cons are same as above.
+So, the goals are as follows :facepunch: :
+- Zone database returns a pointer to zone structure, and the user can do whatever he wants to do with it (i.e. read, write, delete) and as long as the user does not return the zone back to zone zone database, he's guaranteed that nothing will change in the zone - there might be two types of `getters` - read and write.
+- No direct calls to `rcu_`, possibly drop RCU altogether.
+- Changesets created on-the-fly or by zone-diff in the first implementations, the structure itself has to be changed to store sorted changes, because of DNSSEC chain fix.
+- Custom allocators.
+- ~~Event q for each zone. [More here](zone-events-serialization)~~
+- Journal is part of the zone structure and is not visible to the user.
+Misc notes :musical_note: :
+- We could either maintain two versions of zone, new and old, but that would mean that only one thread can change the zone, **or**:
+~~- We could maintain many versions of zones, it would be harder to code, but not that much harder I think - but do we have a scenario when two concurrent threads need to write into zone? - probably not~~ (no, we don't want that @mvavrusa)
+- The fastest way to achieve this (in terms of when it's coded, not execution time) is to create a full copy when requesting a writable zone, then wait for readers before switching, much like it's done now. This would obviously cost a lot of memory, but if we used pools for each zone it would be okay after the operation is done.
+- The better way would be to code a copy-on-write approach - but that would require some atomic operations in trie, and that might be hard to achieve (i.e. insertions and retrievals would need to be atomic - and probably by using (spin)locks - bad? Global zone locks for the whole time of operation are out of question ihmo, unless we're talking about very very small changes)
+  - @mvavrusa: I think it is reasonable to have a small lock for zone changes because:
+     - UPDATE is limited to 1 packet (=> should be fast)
+     - IXFR is USUALLY smaller (think jitter) and sometimes LARGER (resign), but the frequency of large transfers is quite low.
+     - AXFR should be loaded in the background and replace old zone I think.
+  - Those operations would be faster if we didn't copy the zone in the first place, IF the operations complete within 5s or so it would only show as a dip in performance once in a while. BUT (or should I say BUTT), the locks must be fast and each answering thread should have its own so they don't wait for themselves (`pthread_rwlock_rdlock() ?`)
+  - IF it is guaranteed that only one thread can modify the zone, then we could modify the zone without the write lock, raise the write lock, commit the changes to the zone, fixup things like prev-next, broken NSEC(3) chains and so on, drop the write lock.
+      - @jkadlec: That would work, and it might be easier to code than what's below, but lookups would have to check the zone and changes at the same time. For node addition/removal we still need atomic trie add / delete, but we could store stuff every once in a while with global lock, rest would be left for answering.
+      - @mvavrusa: I don't think we need atomic add/delete. Zone should change from only one thread at a time and lock must be held for the whole operation, so the zone contents stays consistent during answering.
+      - @jkadlec: We've discussed this, and basically, we need some measurements as to what operations take most of the time after transfers/whatever. If we could reduce the lock time to some reasonable time, say 1-3 seconds for the biggest zones, then atomic add/delete is not needed (would mean a ton of preprocessing so that only necessary stuff is done during lock) . If it takes more than that, then yes, we'll need it.
+  - So the critical section would be basically current adjusting + a bit of stuff. Or we could raise the lock earlier and apply changes transparently to the zone, but good luck with rollback if something fails (+ multipacket zone transfers would hold the lock for the entire time?) - @jkadlec: I think that rollback would be virtually the same for this approach and the one above. This approach would mean easier and faster lookups (when the node is not being updated, of course).
+Status update 1:
+* DDNS now uses zone_update_t.
+  * It works, it is usable, it cleans up the code.
+  * The insides are still ugly, it directly uses the changeset contained in the zone_update_t structure. To fix that we need to improve changeset.
+* We decided to scrap JK's DNSSEC chain update, as it brings no obvious advantages (cleans up the zone_update_commit() function but complicates the code elsewhere)
+* The current implementation (incremental zone_update (changeset-based) only):
+  * Gathers RRs into changeset, then on commit: copies the zone contents, applies the changeset to it, creates a second changeset (only when DNSSEC enabled), generates the NSEC(3) chain changes to it , applies the second changeset, merges them, saves the changes to the journal, unlocks rcu, switches zone contents, synchronizes rcu, locks rcu, frees the first zone contents.
+  * The previously discussed (see above) idea to do copy-on-write is currently being ignored for simplicity. DDNS was already done like this anyway.
+* Next steps:
+  * Improve changeset, cleanup DDNS internals.
+  * Convert IXFR to zone_update_t. Should be similar to DDNS.
+  * Add FULL updates to zone_update, convert AXFR to it.
+  * Convert initial zone loading to use FULL zone_update.
+  * Other than these, we shall see. The JK's original zone-api branch contains many bugs and unfinished parts. I also have my doubts about usefulness of some.
+**Feel free to add whatever you want, and/or edit what's already here.**