Investigating Intermittent Issues with Cloud Services

Incident Report for SecureAuth Service

Postmortem

Description of events:

On Wednesday, June 6 2018, at 1:22 PM EDT the SecureAuth Cloud monitoring systems alerted the engineering team to a Geo-location database failure at the US1 data center. An investigation of the issue was initiated at that time and confirmed there were intermittent database timeouts occurring on that database and the cause of the timeouts was being investigated. At 1:39 PM EST the active database cluster node became unresponsive. Due to the nature of the node failure the cluster fail-over manager could not complete the fail-over automatically requiring manual intervention. A manual cluster fail-over was initiated at 12:45 PM EDT and completed at 12:58 PM EDT. Recovery was completed and normal operation restored and verification of services was complete at 2:04 PM EDT.

What was the issue or cause?

The database replication and web services timeouts leading up to the node failure were due to a performance issue with the iSCSI SAN. The active database node became unresponsive and an automatic cluster fail-over to the secondary node initiated but could not complete requiring intervention.

What was done to correct the issue?

A recycle of the failing database node and manual cluster fail-over was completed resolving the performance issue.

What is being done to ensure this type of failure is avoided going forward?

SAN resources have been reconfigured to address potential future performance issues related to the Database cluster. The SAN replacement project currently underway has been prioritized and accelerated. The Database nodes are being reliability tested and the appropriate actions will be completed as determined.

Posted Jun 08, 2018 - 13:53 PDT

Resolved

Cloud services issues have been resolved. Post-mortem to follow once the investigation is complete.

Posted Jun 06, 2018 - 11:03 PDT

Investigating

We are currently investigating intermittent issues with cloud services. Current services that may be impacted are SMS, Telephony, Push, Location Services, and Certificate Enrollment. We will provide updates as we receive more information.

Posted Jun 06, 2018 - 10:51 PDT

This incident affected: SecureAuth Cloud Services (Enhanced Geolocation Resolution Service - US1, Enhanced Geolocation Resolution Service - US2, Geolocation Resolution Service - US1, Geolocation Resolution Service - US2, Nexmo Voice API, Push-to-Accept Service - US1, Push-to-Accept Service - US2, SMS Service - US1, SMS Service - US2, Telephony Extension/DTMF Service - US1, Telephony Extension/DTMF Service - US2, Telephony Provider SMS API, Telephony Service - US1, Telephony Service - US2, Threat Service - US1, Threat Service - US2, X.509 Certificate Service (SHA2) - US1, X.509 Certificate Service (SHA2) - US2).