RCA - Cascading Service Failure – 12-15-2025
Incident date: December 15, 2025
Duration: ~90 minutes
Severity: SEV-1
Services affected: Polaris platform services
Summary
On December 15, 2025, we experienced a significant service disruption caused by a cascading failure across multiple internal systems.
A brief loss of network connectivity (leadership) in our Vault cluster triggered restart loops in dependent services. Automated scaling amplified those restarts, resulting in a surge of database connections that overloaded our CockroachDB cluster. This caused elevated latency and service unavailability across a majority of tenants.
No customer data was lost.
Fortunately, improvements made with a new EKS cluster with GitOps methodology, engineering was able to quickly gain details on the various impacted services and formulate a deep understanding of what was occurring. This lead to a faster recovery time, while also leading to more informed action items as an outcome.
What happened
At approximately 1:25 PM PST, Vault lost its active leader following a transient internal connectivity issue. Vault itself was not under resource pressure, but several application services depended on Vault being continuously available.
When Vault became unavailable:
Twilight and other dependent services failed health checks
Those services entered restart loops
Kubernetes autoscaling rapidly increased pod counts, overloading cluster
The resulting connection storm overloaded CockroachDB
Once CockroachDB was saturated, Vault could not recover cleanly, prolonging the incident until manual intervention reduced system load.
The combination of the overloaded cluster and CockroachDB led to the new cluster's improved self healing and auto recovery being blocked.
Impact
During the incident window:
Requests across all tenants experienced increased latency or failure
Services depending on CockroachDB were degraded
New workloads were unable to schedule due to cluster pressure
Customer-facing services were restored once database load and service mass restarts returned to normal levels.
Timeline
All times PST.
1:25 PM – Vault loses leader following internal connectivity disruption
1:27 PM – Twilight health checks begin failing; pods restart repeatedly
1:28 PM – Autoscaling increases pod counts to maximum limits
1:28 PM – CockroachDB CPU and connection counts spike sharply
1:37 PM – Operations team begins investigation into alert spikes
1:50 PM – Support updates slack with incoming support tickets
1:55 PM – Support/Operations declare official incident, Zoom meeting occurs, StatusPage updated
2:10 PM – CockroachDB scaled to stabilize database performance
2:30 PM – Vault recovers after database stabilization
2:35 PM – Affected services manually scaled down to alleviate cluster pressure due to HPA
2:45 PM – Platform fully recovered, services return to typical state
Vault leadership loss
Vault lost its leader early in the incident window. While brief, this event disrupted secret retrieval for dependent services. Vault was not CPU or memory constrained; the failure mode was related leadership and internal connectivity.
Dependency failure
Twilight services required continuous Vault connectivity to operate. When Vault became unavailable, those services were unable to start successfully. This hard dependency prevented graceful degradation and contributed directly to restart loops.
Service restart amplification
Twilight services repeatedly restarted after losing access to Vault. Autoscaling responded to these restarts by rapidly increasing replica counts. Rather than restoring availability, this behavior amplified load across the cluster.
CockroachDB saturation
The surge in restart activity and autoscaled pods caused a sustained spike in database CPU usage and connection counts. This saturation increased query latency and prevented dependent systems from recovering automatically.
Root cause
This incident resulted from multiple conditions interacting:
Hard dependency on continuous Vault availability - Services failed rather than degrading when secrets could not be retrieved.
Overly permissive autoscaling limits - Default HPA maximums allowed rapid scaling beyond operational needs.
Shared database cluster infrastructure overload - The resulting connection storm overwhelmed CockroachDB, blocking recovery.
Individually, none of these issues would have caused a platform-wide outage. Together, they resulted in a cascading failure.
What we’re changing
We are making the following improvements:
Application architecture
Remove hard dependencies on continuous Vault availability (Github PR created, in review)
Ensure secrets are retrieved and held securely at startup (Github PR created, in review)
Prevent expensive startup operations from running on every restart, no longer running on every pod restart, but once per major maintenance (Github PR created, in review)
Infrastructure configuration
Reduce autoscaling maximums to realistic service-specific limits (To be updated 12/16 overnight)
Isolate Vault onto a dedicated database backend (In Progress)
Review usage of Vault, reduce dependency by moving to alternative solution (K8s secrets with SOPs)
Resilience and monitoring
Update health checks so non-critical failures do not trigger restarts (PR created, in review)
Add safeguards to prevent database overload during restart storms
Expand failure-mode testing in lower environments
Multi-Region Hot Standby (In Progress, Available Q2 2026)
Closing
A brief disruption in a single component should not cascade into a platform-wide incident. This event highlighted areas where coupling and automated responses amplified failure instead of containing it.
While we have made vast improvements to our overall infrastructure over the last few months (adopting GitOps, creating a new cluster with improved monitoring and core services, increased metric and log gathering and retention, eliminating any manual changes, etc) we are addressing those gaps and strengthening the platform to improve resilience going forward.
We will be scheduling a maintenance window within the next two weeks to implement immediate remediation action items, while hot standby will lead to further improvements and failover within minutes once available. These immediate changes (due to deep RCA provided by improved cluster telemetry) coupled with the multi-region hot standby live in Q2 will prevent the underlying causes of the recent incidents.