RCA – Cloud Services Outage/RDS Failure - October 23, 2025
Problem Description: On October 23, 2025, at 12:30 PM PDT, SecureAuth’s Cloud infrastructure experienced degradation of the IP risk evaluation service, which by around 2:00 PM PDT, escalated into widespread database connection issues, affecting multiple cloud services and resulted in authentication failures for impacted customers. A gradual rolling restart of all single tenant services in batches restored services, leading to a full recovery at 4:00 PM PDT.
Cause: The incident originated with an unbounded increase in the IP risk evaluation service traffic (ipintelsvc) to the production database. The resulting proliferation of concurrent, heavy-query sessions overwhelmed the shared Aurora RDS cluster, including the segment serving the Vault secret-storage service. Loss of database connectivity led to a Vault crash and downstream failures in dependent services.
Recovery: The DevOps and Engineering teams initiated resolution efforts by scaling down the IP risk evaluation service traffic (ipintelsvc), which led to RDS CPU and connection metrics gradually decreasing, allowing auxiliary vault services to come back up. A rolling restart was then performed for all customer service deployments, in a gradual batched fashion.
Timeline: October 23, 2025
Corrective Actions:
Further enhancements leading to the separation and resilience of single tenant services will be made, eliminating the risk of the cascading restart requirement, and decoupling from the shared services RDS. This infrastructure has already been implemented and applied in Dev and Test SecureAuth Cloud environments, and we are starting to roll it out in Production.
If there is a preference to be placed on a priority list to have the updates performed to your tenant, please reach out to your CSM or Support and we will work on prioritizing a change window for your organization.