Incident Description
Root Cause
Thales was performing maintenance on the Cloud HSM infrastrucuture over the weekend which caused the SecureAuth Certificate Authority (CA) systems to be unable to connect to the Cloud HSM for key validation.
The SecureAuth CA’s regularly renews the Certificate Revocation Lists (CRL) for multiple CA’s - tthe expiration of the delta CRLs is approximately 48 hours, which is why we did not have any impacts from the Thales maintenance until Sunday evening with customers not being impacted until Monday morning.
To further exacerbate the problem, the alerts generated by the monitoring systems were not going to the location the L1 team monitors.
Corrective Actions
Documentation was not completely up-to-date on the configuration of the multiple-region deployment of the NGE certificates. The DevOps team will be reviewing the documentation and updating as necessary.
Review of all DevOps alerts has been conducted to ensure all alerts are going to the location that is actively monitored vs. the Slack channel that also has the alerts, but is not routinely monitored.
The DevOps Team has enrolled in status updates through the Thales Status Page and will review any changes or maintenance that is posted to that site to ensure internal testing can be performed to validate all SecureAuth operations are not impacted.