Polaris Twilight Outage RCA - September 12, 2024
Problem Description
On September 11, 2024 at 7:16PM, the SecureAuth Cloud Infrastructure encountered widespread connection issues with databases systems which resulted in authentication failures for impacted customers.
Cause
The SecureAuth Cloud Operations team was alerted of connections issues with the Twilight service (integral service which other microservices are reliant). Upon investigation, we identified that the service was experiencing database latency due to CPU utilization spikes on the database. The CPU spikes triggered mass restarts of the Twilight Service which in turn caused extended CPU spikes on the database.
The root cause was due to legacy dependencies on the database that were negatively affected during a redistribution exercise related to the Vault migration performed on August 29, 2024. Those legacy dependencies were originally determined to be benign, and therefore assumed to have no impact to the customer base after the Vault migration.
It was determined that the CPU spikes were caused by the interface between the service and the database in form of health checks that created a snowball effect, resulting in the aforementioned issues with the Twilight service.
Due to the nature of this issue, not all customers were immediately impacted; however, the recovery and resolution of this issue impacted all customer cloud services as a result of the scaling operations.
Recovery
To mitigate this issue, the cloud services were scaled down alleviate database pressure. Once the database stabilized, the services were scaled back up in a controlled manner until all services were fully restored.
Timeline:
Sep 11, 2024
Corrective Actions