Issues with Polaris services

Incident Report for SecureAuth Service

Postmortem

RCA – EKS Outage - 09022023

Leadership Response:

We apologize for the inconvenience and the difficulty your teams faced as you gave up your time with friends and family to communicate and resolve this incident with your internal users and customers. This is not an event we take lightly and had all hands-on deck to resolve the issue as quickly and efficiently as possible.

Your experience with SecureAuth is very important to us. We value our partnership and the trust you continually put into our solution to protect your teams and your customers. We will continue to strive for excellence and make any changes necessary to deliver the stability and security that you require to be successful.

Incident Summary:

During a planned maintenance window between 06:00 and 12:00 UTC on September 2, 2023, a majority of SA IdP tenants on the EKS cluster failed resulting in an outage. All customers, cloud and hybrid deployments that use cloud services, were down or degraded during the incident.

As a routine update of backend services was being performed, a networking component plugin failed its update, which caused the Vault service which stores application keys to fail. Since most production pods rely on the Vault service to obtain secrets, none of the production pods could come online. To resolve this, we reinstalled the networking component successfully, which allowed Vault to communicate with the rest of the system.

Once this issue was resolved, there was an influx of back-logged communication to the backend database as all the production pods came back online. This overloaded the database connection pool causing additional bandwidth issues that impacted response times. To accelerate recovery of the entire environment, we temporarily reduced the number of active pods, which allowed the system to process the backlog.

At approximately 15:50 UTC, all services were restored, and a postmortem of the incident began.

Root Cause:

HashiCorp Vault failed to start after the EKS cluster update. This required manual intervention for VPC-CNI and CoreDNS add-ons. Vault is a critical dependency of many other cloud services to start.
Once Vault was operational, thousands of pods attempted to come online at once and many of them need to connect to one or more databases. The database servers became overwhelmed, preventing all services from coming back on-line.
A bug was discovered in the AMI used for production EKS worker nodes that causes auto-scaling of deployment replicas to grow to maximum capacity. This bug creates over-reporting of how much CPU is being used by each pod. This, in turn, generated about three times the normal pod count, further complicating recovery.

Resolution:

Updated the VPC-CNI
Restarted CoreDNS and Vault
Throttled replication events to prevent overload on the databases
Deployed a workaround to the auto-scale bug to prevent overruns on connections

Corrective Actions:

Instrumenting the upgrade process to detect CNI and Vault failures
Deploying future upgrades in isolated pod clusters to reduce impact to customers
Deploying a full resolution for the auto-scale bug
Completing any outstanding EKS upgrade tasks
Implementation of new communication protocol to inform customers of future incidents in a more timely and comprehensive manner

Posted Sep 12, 2023 - 06:32 PDT

Resolved

We will be continuing to watch the platform; however, we have had no further indications of system issues. Please contact support@secureauth.com if you have any issues.

A full Root Cause Analysis (RCA) will be forth-coming.

Posted Sep 03, 2023 - 09:05 PDT

Update

The outage you’ve been experiencing has been resolved, and all effected parties have been restored. We are monitoring the status closely. We appreciate your patience with our teams as we worked diligently to bring you back online.

A formal root cause analysis (RCA) will be forthcoming, but we wanted to provide as much information as possible upfront while we work on the official RCA which will be provided through our Customer Experience team.

During a routine update of backend services, a networking component plugin failed its update, which caused the Vault service (storage of application keys) to fail. We reinstalled the networking component successfully, which allowed Vault to communicate with the rest of the system.

This issue caused an influx of communication to the backend database that overloaded the system and also resulted in spinning up an excessive number of pods causing additional bandwidth issues that even further aggravated the response times and connection issues.

To resolve, we manually reduced the number of active pods, which allowed the system to slowly recover from the outage.

Posted Sep 02, 2023 - 09:53 PDT

Monitoring

The infrastructure has been recovered and loads are returning to normal. We will continue to monitor.

Posted Sep 02, 2023 - 09:15 PDT

Update

Some services have been restored at this time, but we are continuing to troubleshoot the issue.

Posted Sep 02, 2023 - 08:25 PDT

Update

We have identified the issue and are working to restore service. We will update when we have more information.

Posted Sep 02, 2023 - 07:56 PDT

Update

We have discovered an infrastructure issues and have AWS support is involved helping to resolve the issue.

Posted Sep 02, 2023 - 06:53 PDT

Update

We believe we have a focal point for the investigation and will post an update as soon as we have more info.

Posted Sep 02, 2023 - 05:34 PDT

Update

We are continuing to work on a fix for this issue.

Posted Sep 02, 2023 - 05:01 PDT

Identified

We are currently investigating intermittent issues with Polaris services. Current services that may be impacted are full cloud and hybrid SA IdP installations. We will provide updates as we receive more information. While our maintenance window has not yet finished (maintenance window ends 8 AM Eastern Time), we are notifying customers of a potential issue. We will keep you updated as we progress.

Posted Sep 02, 2023 - 04:50 PDT