If you're not on the routing-wg mailing list, there's something you
should know
-------- Forwarded Message --------
Subject: [routing-wg] RPKI Outage Post-Mortem
Date: Tue, 25 Feb 2020 15:12:15 +0100
From: Nathalie Trenaman <nathalie(a)ripe.net>
To: routing-wg(a)ripe.net
Dear colleagues,
From Saturday 22 February at 08:24 (CET), any newly created, modified,
or deleted ROAs (176 in total) could not be added to our publication
server due to a disk problem. From that moment on, all the data was
stored on the database, but the publication did not happen. The disk did
not report any problems and, therefore, no engineer was alerted of this
incident.
Due to the disk problem, starting from Sunday 23 February at 09:10
(CET), our CRL expired and our repository could not be properly updated.
This was reported to us on Monday 24 February at 11:44 (CET).
Immediately, our engineers fixed the disk problem, however, since the
CRL expired, all underlying objects also expired. Depending on the
Relying Party software an operator used, this abnormal behaviour
appeared differently.
Initially, our engineers tried to do a full re-population of the RPKI
repository, but unfortunately, this did not update the CRL in the
validation tree. At 15:03 (CET), we performed a full CA key-roll, which
was completed at 21:02 (CET) and resolved the problem. At 19:58 (CET),
all objects in the backlog were published.
We apologise for any inconvenience this may have caused and we are
taking all the necessary steps to ensure this incident does not appear
again in the future.
Kind regards,
Nathalie Trenaman
Routing Security Programme Manager
RIPE NCC