Cloud Service Issue

Incident Report for SecureAuth Service

Postmortem

Polaris Twilight Outage RCA - September 12, 2024

Problem Description

On September 11, 2024 at 7:16PM, the SecureAuth Cloud Infrastructure encountered widespread connection issues with databases systems which resulted in authentication failures for impacted customers.

Cause

The SecureAuth Cloud Operations team was alerted of connections issues with the Twilight service (integral service which other microservices are reliant). Upon investigation, we identified that the service was experiencing database latency due to CPU utilization spikes on the database. The CPU spikes triggered mass restarts of the Twilight Service which in turn caused extended CPU spikes on the database.

The root cause was due to legacy dependencies on the database that were negatively affected during a redistribution exercise related to the Vault migration performed on August 29, 2024. Those legacy dependencies were originally determined to be benign, and therefore assumed to have no impact to the customer base after the Vault migration.

It was determined that the CPU spikes were caused by the interface between the service and the database in form of health checks that created a snowball effect, resulting in the aforementioned issues with the Twilight service.

Due to the nature of this issue, not all customers were immediately impacted; however, the recovery and resolution of this issue impacted all customer cloud services as a result of the scaling operations.

Recovery

To mitigate this issue, the cloud services were scaled down alleviate database pressure. Once the database stabilized, the services were scaled back up in a controlled manner until all services were fully restored.

Timeline: Sep 11, 2024

7:16PM PST – Twilight connection issues begin and alerts were triggered
7:17PM PST – Cloud Operations team join bridge to investigate alerts
7:27PM PST – Issue is understood and mitigation efforts begin
7:27PM PST – Scale down of cloud services to alleviate database pressure begins.
7:40PM PST – Scale down complete and database CPU utilization stabilizes
7:41PM PST – Controlled (staggered) scale up of cloud services begins
8:30PM PST – Controlled scale up of cloud services is completed
8:40PM PST – All services in running state
9:00PM PST – Validation testing complete and incident resolved
Post-9:00PM PST – Continued to monitor closely while working with some customers as needed to resolve intermittent issues caused by the incident.

Corrective Actions

Engineering to review and improve the Twilight to Cockroach Database interface and determine a more elegant solution to the health check actions that would diminish the result of mass-restarts of the service during periods of high-usage spikes.
Leadership review of database alternatives to the solution architecture
Improve decision-making accuracy by increasing team knowledge around legacy systems to ensure end to end awareness of potential impacts to assumed benign configuration changes.
Introduce additional gates into the existing CAB (Change Advisory Board) process, including additional Engineering leadership, including cross-functional Subject Matter Experts

Posted Sep 12, 2024 - 13:30 PDT

Resolved

The incident has been resolved. For any remaining issues or questions please reach out to support@secureauth.com.

Posted Sep 11, 2024 - 21:38 PDT

Monitoring

A fix has been implemented and all issues have been resolved. We are continuing to monitor.

Posted Sep 11, 2024 - 21:17 PDT

Identified

We have identified the issue and are in the process of implementing a fix.

Posted Sep 11, 2024 - 20:33 PDT

Investigating

We are investigating an issue with our Cloud Services, and will post updates as we gain understanding to the issue.

Posted Sep 11, 2024 - 19:43 PDT

This incident affected: Workforce (SMS, Voice, Push).