The Push to Accept (P2A) service was sending push requests to users; however, approval of those requests were not getting back to the Identity Platform (IdP) to finish the login process for a user with P2A multi-factor authentication (MFA). All other MFA methods were still available to users.
During normal, automated auto-scaling of the Amazon Elastic Kubernetes Service (EKS), there was an issue with the networking components of the EKS nodes, which caused the Redis databases used within the environment to lose access with other services. It is suspected that the auto-scaling of the EKS environment impacted the core networking and DNS services of the cluster due to a single auto-scaling policy for the entire cluster.
Alerts were generated and investigated by the DevOps Team within minutes of the networking issues; however, these alerts were generated from multiple services, and it was eventually determined to be the Redis services after multiple other services were also cycled.
The Redis Pods were cycled, and the Push Service was restored to normal operation.
The DevOps Team is working to move from Redis pods on EKS to utilizing the AWS Elasticache service to allow for redundancy and additional stability for the service. ETA is the end of 2022.
Additional alerts have been created updated procedures have been added to runbooks to minimize a potential future failure.
The DevOps team is modifying the auto-scaling policies to separate the core services into one auto-scaling group (ASG) and the application itself into a separate ASG. This is expected to be completed by the end of October 2022.