Push Service Failures

Incident Report for SecureAuth Service

Postmortem

SecureAuth Production Incident Root Cause Analysis (RCA)

Incident Description

The Push to Accept (P2A) service was sending push requests to users; however, approval of those requests were not getting back to the Identity Platform (IdP) to finish the login process for a user with P2A multi-factor authentication (MFA). All other MFA methods were still available to users.

Root Cause

During normal, automated auto-scaling of the Amazon Elastic Kubernetes Service (EKS), there was an issue with the networking components of the EKS nodes, which caused the Redis databases used within the environment to lose access with other services. It is suspected that the auto-scaling of the EKS environment impacted the core networking and DNS services of the cluster due to a single auto-scaling policy for the entire cluster.

Resolution

Alerts were generated and investigated by the DevOps Team within minutes of the networking issues; however, these alerts were generated from multiple services, and it was eventually determined to be the Redis services after multiple other services were also cycled.
The Redis Pods were cycled, and the Push Service was restored to normal operation.

Corrective Actions

The DevOps Team is working to move from Redis pods on EKS to utilizing the AWS Elasticache service to allow for redundancy and additional stability for the service. ETA is the end of 2022.
Additional alerts have been created updated procedures have been added to runbooks to minimize a potential future failure.
The DevOps team is modifying the auto-scaling policies to separate the core services into one auto-scaling group (ASG) and the application itself into a separate ASG. This is expected to be completed by the end of October 2022.

Posted Oct 06, 2022 - 19:14 PDT

Resolved

This incident has been resolved.

Posted Oct 06, 2022 - 12:35 PDT

Monitoring

The issue has been resolved, now monitoring

Posted Oct 06, 2022 - 09:48 PDT

Update

Other MFA methods (TOTP, SMS, etc) are still working

Posted Oct 06, 2022 - 09:27 PDT

Identified

The issue has been identified and a fix is being implemented.

Posted Oct 06, 2022 - 09:00 PDT

Investigating

The push service is not accepting approval for push to accept

Posted Oct 06, 2022 - 08:52 PDT