Cloud Services Issue

Incident Report for SecureAuth Service

Postmortem

RCA - Cascading Service Failure – 12-15-2025

Incident date: December 15, 2025
Duration: ~90 minutes
Severity: SEV-1
Services affected: Polaris platform services

Summary

On December 15, 2025, we experienced a significant service disruption caused by a cascading failure across multiple internal systems.

A brief loss of network connectivity (leadership) in our Vault cluster triggered restart loops in dependent services. Automated scaling amplified those restarts, resulting in a surge of database connections that overloaded our CockroachDB cluster. This caused elevated latency and service unavailability across a majority of tenants.

No customer data was lost.

Fortunately, improvements made with a new EKS cluster with GitOps methodology, engineering was able to quickly gain details on the various impacted services and formulate a deep understanding of what was occurring. This lead to a faster recovery time, while also leading to more informed action items as an outcome.

What happened

At approximately 1:25 PM PST, Vault lost its active leader following a transient internal connectivity issue. Vault itself was not under resource pressure, but several application services depended on Vault being continuously available.

When Vault became unavailable:

Twilight and other dependent services failed health checks
Those services entered restart loops
Kubernetes autoscaling rapidly increased pod counts, overloading cluster
The resulting connection storm overloaded CockroachDB

Once CockroachDB was saturated, Vault could not recover cleanly, prolonging the incident until manual intervention reduced system load.

The combination of the overloaded cluster and CockroachDB led to the new cluster's improved self healing and auto recovery being blocked.

Impact

During the incident window:

Requests across all tenants experienced increased latency or failure
Services depending on CockroachDB were degraded
New workloads were unable to schedule due to cluster pressure

Customer-facing services were restored once database load and service mass restarts returned to normal levels.

Timeline

All times PST.
1:25 PM – Vault loses leader following internal connectivity disruption
1:27 PM – Twilight health checks begin failing; pods restart repeatedly
1:28 PM – Autoscaling increases pod counts to maximum limits
1:28 PM – CockroachDB CPU and connection counts spike sharply
1:37 PM – Operations team begins investigation into alert spikes
1:50 PM – Support updates slack with incoming support tickets
1:55 PM – Support/Operations declare official incident, Zoom meeting occurs, StatusPage updated
2:10 PM – CockroachDB scaled to stabilize database performance
2:30 PM – Vault recovers after database stabilization
2:35 PM – Affected services manually scaled down to alleviate cluster pressure due to HPA
2:45 PM – Platform fully recovered, services return to typical state

Vault leadership loss

Vault lost its leader early in the incident window. While brief, this event disrupted secret retrieval for dependent services. Vault was not CPU or memory constrained; the failure mode was related leadership and internal connectivity.

Dependency failure

Twilight services required continuous Vault connectivity to operate. When Vault became unavailable, those services were unable to start successfully. This hard dependency prevented graceful degradation and contributed directly to restart loops.

Service restart amplification

Twilight services repeatedly restarted after losing access to Vault. Autoscaling responded to these restarts by rapidly increasing replica counts. Rather than restoring availability, this behavior amplified load across the cluster.

CockroachDB saturation

The surge in restart activity and autoscaled pods caused a sustained spike in database CPU usage and connection counts. This saturation increased query latency and prevented dependent systems from recovering automatically.

Root cause

This incident resulted from multiple conditions interacting:

Hard dependency on continuous Vault availability - Services failed rather than degrading when secrets could not be retrieved.
Overly permissive autoscaling limits - Default HPA maximums allowed rapid scaling beyond operational needs.
Shared database cluster infrastructure overload - The resulting connection storm overwhelmed CockroachDB, blocking recovery.

Individually, none of these issues would have caused a platform-wide outage. Together, they resulted in a cascading failure.

‌

What we’re changing

We are making the following improvements:

Application architecture

Remove hard dependencies on continuous Vault availability (Github PR created, in review)
Ensure secrets are retrieved and held securely at startup (Github PR created, in review)
Prevent expensive startup operations from running on every restart, no longer running on every pod restart, but once per major maintenance (Github PR created, in review)

Infrastructure configuration

Reduce autoscaling maximums to realistic service-specific limits (To be updated 12/16 overnight)
Isolate Vault onto a dedicated database backend (In Progress)
Review usage of Vault, reduce dependency by moving to alternative solution (K8s secrets with SOPs)

Resilience and monitoring

Update health checks so non-critical failures do not trigger restarts (PR created, in review)
Add safeguards to prevent database overload during restart storms
Expand failure-mode testing in lower environments
Multi-Region Hot Standby (In Progress, Available Q2 2026)

‌

Closing

A brief disruption in a single component should not cascade into a platform-wide incident. This event highlighted areas where coupling and automated responses amplified failure instead of containing it.

While we have made vast improvements to our overall infrastructure over the last few months (adopting GitOps, creating a new cluster with improved monitoring and core services, increased metric and log gathering and retention, eliminating any manual changes, etc) we are addressing those gaps and strengthening the platform to improve resilience going forward.

We will be scheduling a maintenance window within the next two weeks to implement immediate remediation action items, while hot standby will lead to further improvements and failover within minutes once available. These immediate changes (due to deep RCA provided by improved cluster telemetry) coupled with the multi-region hot standby live in Q2 will prevent the underlying causes of the recent incidents.

Posted Dec 16, 2025 - 16:36 PST

Resolved

The incident has been resolved. We will continue to monitor and will provide an RCA once the post-mortem investigation is complete.

If you have any questions or require assistance in the meantime, please contact support at https://support.secureauth.com

Posted Dec 15, 2025 - 15:12 PST

Monitoring

A fix has been implemented and we are monitoring the results.

If you are still having issues, please let us know by logging a ticket at https://support.secureauth.com

Posted Dec 15, 2025 - 14:56 PST

Update

We are continuing to work on a fix for this issue.

Posted Dec 15, 2025 - 14:48 PST

Identified

The issue has been identified and our operations team is working on implementing a fix.

Posted Dec 15, 2025 - 14:43 PST

Investigating

We are currently investigating a Cloud Services issue, and will provide updates as we have more information.

If you have any questions in the meantime, please log a ticket at https://support.secureauth.com.

Thank you

Posted Dec 15, 2025 - 14:00 PST

This incident affected: Workforce (Push, Cloud IdP).