Performance Issues on US1 Cloud Service
Incident Report for SecureAuth Service
Postmortem

Incident Description

The evening of 2018-04-01 SecureAuth customers began experiencing issues with MFA. Full availability of the impacted services was restored the next morning (2018-04-02) at approximately 10:25 AM EDT.

Root Cause

After a thorough analysis and review of our systems we have determined the following issues were the primary contributors to the failure: - SACloud has multiple backup processes running on the backend database, including full backups, differentials, and transactional logs. The various backups are performed by a combination of tools. At the time of the outage, the process for one of these backup tools was consuming abnormally high CPU, starving the database service of the necessary CPU cycles to return queries to the front end SACloud web server requests. - The immediate fix was to stop and disable that backup process. After consulting with our backup vendors, we learned there is a potential for conflicts between that tool and other backup processes that would result in it consuming high CPU. This is most likely what caused its sudden increase in CPU utilization.

Corrective Actions

The following measures are being taken to prevent an incident of this type from happening in the future:

  • The tool that was consuming high CPU remains disabled and SecureAuth is continuing to perform backups with other tools.
  • SecureAuth’s systems operations team is evaluating the right combination of tools, scheduling, resource limiting, and scripted backup methods to meet our data integrity and disaster recovery needs, but that will not create conflicts like the one that caused this outage.
  • SecureAuth’s systems operation and development teams are investigating better ways to monitor and alert for similar conditions in the future, and internal health monitoring thresholds and escalation procedures will be adjusted and tested to ensure notifications and alerts are issued expeditiously.
Posted 3 months ago. Apr 06, 2018 - 19:16 UTC

Resolved
The issue has been resolved. Post-mortem to follow once investigation is complete.
Posted 4 months ago. Apr 02, 2018 - 21:27 UTC
Monitoring
Performance issues on our US1 cloud service have been resolved. We are closely monitoring the situation.
Posted 4 months ago. Apr 02, 2018 - 14:55 UTC
Investigating
We are currently investigating performance issues on our US1 cloud service. There may be delays or failures for SMS, Voice, Push, and other cloud related functions.
Posted 4 months ago. Apr 02, 2018 - 14:21 UTC