Apply jitter to signature lifetime & improve the signature renew interval

mentioned in issue #155 (closed)

Enabling signature jitter with no further changes would basically mean that server would be resigning the zone throughout the jitter interval. That's okay, but resigning = generating all the signatures from scratch, at least for now, that means 100% server load (or signing thread load) for the jitter period.

We could:

modify the signature checking algorithm for planned resign, so that it only checks for expiration. That would reduce server load a lot. But all other checks would still need to create the signature (reload, ddns - RRs can change, RRSIGs would be bogus. We probably should do this either way, I've created issue #170 (closed)
~~extend the expiration interval so that all the 'jittered' signatures would get resigned at one signing run~~

I think the jitter should not be completely random. We could rather spread the signing over the renewal interval uniformly and do the signing in batches.

Let's say there are 1000 records in zone to be signed. And the signatures for these records should be signed within an interval A - B. For instance, there will be 5 signing batches. In every batch 200 signatures will be renewed.

Here is an illustration of what I was trying to explain:

A * * * * * * * * * * * * * * * * * * * * * * * * B
|         |         |         |         |
1         2         3         4         5

That's exactly what have we agreed upon yesterday with @jkadlec

Blocked by #170 (closed), postponing.

The signature renewal should be much longer, because any problem in the signing can render the zone invalid immediately. There has to be some safety margin, which will give the zone operator extra time to fix problems. (OpenDNSSEC uses 3 days as a default renewal interval.)

Finally, I looked into this and it's not so simple as we first thought. Small recap: the goal is to resign the zone records not all at once, but also not to resign each expired record separately. The main problem is with dynamic updates. They create signing batches scheduled for any time (time of the update + signature lifetime - safety margin).

Currently, each time a regular resign is executed, most of the signatures that expire between that time and the time of next regular resign. (The time is counted somehow weirdly and does not span the whole interval, but that doesn't matter now.) I.e. updates are not a problem, because they are handled together with other records in a resign. (In other words: resign times are unified and even after a long run the server won't be running regular (scheduled) resign more than once in a signature lifetime interval.)

However, if we introduce any form of jittered signature lifetimes (I think there are two basic approaches) and divide the signing into smaller intervals, we loose this "uniting" of updates' signing batches. In the better case it may happen (and likely will, after running long enough) that the server will resign a (small) part of the zone every B seconds, where B is the time between two signing batches as mentioned above. In the worse case the resigning would happen as often as the updates come.

The first question then is: do we want this? If dynamic updates are enabled, the server would be resigning a lot of times, though every resign may affect only a very small portion of the zone. Currently, however, the resigning is not very optimal: all signatures that do not expire in the current interval - and that would be most of the signatures in case of the "jittered" approach - are verified, which is quite heavy. That brings us back to the issue #170 (closed).

There is also an option to choose different strategy with dynamic updates on or off, or to think of some workaround for the updates.

Thoughts?

I think the easiest solution is to resign the dynamically added records in the last regular signing batch prior to the expected resign period of the new RRSIG. As a result, you will have the same number of signing batches all the time.

This is what I mean:

t: * - - - - * - - - - * - - - -
   1     a   2     b   3

Asterisk is an execution of a signing batch. If a new RR is added at the moment a, a new RRSIG will be added into the zone. This RR should be theoretically resigned on time b. So I propose that this record will slip into batch triggered on time 2.

On the other hand, the number of RR types in different signing batches can get unbalanced. I'm not sure how to fix that.

BTW, this applies to DDNS and manual zone updates as well.

Well, it looks nice on the picture, but the situation is a bit different. Signature lifetime (L) is several times larger than the size of the jitter interval and thus also than the intervals between the signing batches. So in real life it would look more like this:

 *----------.-------------------(|--|--|--|--|)-----.------------------------(|--|--|--|--|)----
t_0         u                         t_1           u'

Whole zone is signed at time t_0, it should be resigned in t_1. The resign is divided into several batches (|) around the expected resign time (time between two batches = batch interval, B), throughout the jitter interval (marked by parentheses). Update comes at time u and it should be resigned at time u'.

There is no nice way to set the signature lifetime of the record added at u to the value of the last signing batch or anywhere in the jitter interval (Or is there? Maybe the nearest scheduled resign? And can we do this if the update comes very closely before the scheduled resign, thus limiting the signature lifetime to a very short period?).

But we may probably distinguish, that the currently resigned part is the last batch in the current jitter interval if the nearest expiration is scheduled after longer than the interval between two batches (B) and resign all remaining records until the start of the next jitter interval. But this would not work if u' is less than B from the last signing batch. In that case the resigning of the updated records will remain scheduled and the possible remaining records until the next jitter interval resigned then (at time u'). This situation may iterate, resulting in the above described behaviour.

Or maybe I'm missing something and the pitfall may be avoided...

OK. We are talking about different approaches and I think we should settle on one before doing anything else.

The questions are:

How to implement the jitter?
How to handle first zone signing?
How to deal with new RRs in the zone?

Let's talk about the first question first.

First of all, I think that the jitter need not be shorter than the signature life time in general. The purpose of the jitter is to relieve the master from excessive load when the signatures are expiring. I think that interval between batches can be in minutes or days, and the result will be the same. So there are at least two options:

Regular periodical refresh

We can split all RR types in the zone into batches (A, B, C in the example). And sign them batch by batch, having identical period between subsequent batches:

A1 - - - - B1 - - - - C1 - - - - A2 - - - - B2 - - - - C2 - - - - A3 ...

If we stick with the current behavior, when signatures are refreshed after 0.9 * RRSIG_lifetime, then this must be the interval between A1 and A2.

Multiple refreshes around the same time

Use short jitter and trigger multiple refreshes around the same time:

(A1-B1-C1) - - - - - - - - (A2-B2-C2) - - - - - - - - (A3-B3-C3) ...

Again, the interval between (A1-B1-C1) and (A2-B2-C2) is 0.9 * RRSIG_lifetime. And the interval between A1 and B1 is undefined short.

Right, I was talking about the second approach. The first I didn't consider, because it would require to set signature lifetimes to numbers too different from the configured value. I suppose we should respect the configured value if there is one.

You are right, though, that in the first approach it would be quite straightforward to merge the possible updates into one of the signing batches. Of course, it might become unbalanced. We can either ignore that threat (and assume that the updates are random enough that they get distributed more-or-less equally) or try to come up with some balancing solution. (Just a first thought: when resigning, we can count the number of resigned records and all records in the zone. If the ratio gets too bad, we can schedule forced resign instead of normal for the next period. That will create the batches from scratch.)

The second approach seems better in terms of respecting the signature lifetimes, but either way we implement it (I mentioned two approaches, but didn't explain them, it would be too long ;-), we may, in the long run, end with signing every (B1 - A1) minutes. Maybe we can, in that case, also run the forced resign once in a while and set up the batches from scratch (but how to detect such state?).

As for the jitter interval size: well, in general it can be anything, though it's not a "jitter" anymore, technically speaking. As I said, it's a question of respecting the configured signature lifetimes. I think we should respect them, mostly because the user may want to achieve some particular behaviour or setup. Also, isn't it important when doing a key or algorithm rollover? During a rollover, one should wait until the records expire from the caches. I think some considerations mentioned in RFC 4641, Section 4.1.1 about the TTLs may apply for this case. (User wants to setup the signature lifetime according to the TTL or vice versa).

Right, the first approach is not a jitter. But it is a solution of the same problem. And I think it's main advantage is a better predictability of the signature refresh times.

As for the signature life-time value. I think we can safely use a lower value than the configured one. In addition, this will happen only for the initially added signatures. The refresh will always set the exact life-time as configured.

Another visualization, this time with the initial signing and one dynamic update:

               A0_exp    B0_exp    C0_exp    A1_exp    B1_exp    C1_exp
               |         |         |         |         |         |
+---------+---------+---------+---------+---------+---------+---------+-- ...
|         |     |   |         |    |    |         |         |         |
init      A1    |   B1        C1   |    A2        B2        C2        A3
                |                  |
		u(C)               u(C)_exp

In this case I assume, that a signature An will expire somewhere between A(n+1) B(n+1). The init is the initial signing, leading to shorter life-time signatures for batches A and B, the life-time for C will be the same as configured.

A dynamic update u was arbitrary added into batch C and will be refreshed in C1. It's life-time is shorter as well. And it will be prolonged to configured value during the refresh.

As for the new records, these can be distributed into the batches on round-robin basis or using a uniform random generator. So the batches will be always balanced. And the batch can be easily identified based on it's RRSIG expiration value.

There will be no difference in doing algorithm rollover. The rollover intervals are determined by the maximal TTL in the zone, not the signatures' life-time. Do not consider rollovers at this point, otherwise this will get much more complicated...

removed assignee

changed milestone to %backburner

My own thoughts: we could simplify this to replacing

rrsig_expire = ctx->now + rrsig_lifetime;

with

rrsig_expire = ctx->now + 2*(time(NULL) - ctx->now) + rrsig_lifetime;

possibly also rounding the contents of ( ... ) down to multiples of ten seconds.

This would make the first signing (or resigning after DNSKEY change) unjittered, but the following RRSIG refreshes would be split into batches signable in 10 seconds. In a very natural way.

Anyway, this has also some disadvantages. The performance disadvantage: first let's look on how RRSIGs are processed during their refresh. Zone is iterated, each RRSIG checked first for time validity (simple compare of timestamps), and if valid, cryptographically checked for validity (expensive). If invalid, new RRSIG is created (more expensive). Now if the RRSIGs expire at one moment, the expensive check for validity is performed for none of them. However, if their expire in batches, each time a batch is re-signed, the other RRSIGs are all checked for validity! This also relates to #170 (closed) , which seems unsolveable.

Also, we don't know if users want this feature. They need to scale their power for the most demanding scenario anyway. We don't like to introduce new setting for this in conf, because nobody would turn it on. Make it default seems to be rather dangerous.

Overall, there are more problems than benefits.

closed

Apply jitter to signature lifetime & improve the signature renew interval

Designs

Child items ...

Activity

Admin message

Apply jitter to signature lifetime & improve the signature renew interval

Activity