RCA – EKS Outage - 09022023
Leadership Response:
We apologize for the inconvenience and the difficulty your teams faced as you gave up your time with friends and family to communicate and resolve this incident with your internal users and customers. This is not an event we take lightly and had all hands-on deck to resolve the issue as quickly and efficiently as possible.
Your experience with SecureAuth is very important to us. We value our partnership and the trust you continually put into our solution to protect your teams and your customers. We will continue to strive for excellence and make any changes necessary to deliver the stability and security that you require to be successful.
Incident Summary:
During a planned maintenance window between 06:00 and 12:00 UTC on September 2, 2023, a majority of SA IdP tenants on the EKS cluster failed resulting in an outage. All customers, cloud and hybrid deployments that use cloud services, were down or degraded during the incident.
As a routine update of backend services was being performed, a networking component plugin failed its update, which caused the Vault service which stores application keys to fail. Since most production pods rely on the Vault service to obtain secrets, none of the production pods could come online. To resolve this, we reinstalled the networking component successfully, which allowed Vault to communicate with the rest of the system.
Once this issue was resolved, there was an influx of back-logged communication to the backend database as all the production pods came back online. This overloaded the database connection pool causing additional bandwidth issues that impacted response times. To accelerate recovery of the entire environment, we temporarily reduced the number of active pods, which allowed the system to process the backlog.
At approximately 15:50 UTC, all services were restored, and a postmortem of the incident began.
Root Cause:
Resolution:
Corrective Actions:
Completing any outstanding EKS upgrade tasks
Implementation of new communication protocol to inform customers of future incidents in a more timely and comprehensive manner