Investigating Intermittent Issues with Cloud Services
Incident Report for SecureAuth Service
Postmortem
Description of events:

On Wednesday, June 6 2018, at 1:22 PM EDT the SecureAuth Cloud monitoring systems alerted the engineering team to a Geo-location database failure at the US1 data center. An investigation of the issue was initiated at that time and confirmed there were intermittent database timeouts occurring on that database and the cause of the timeouts was being investigated. At 1:39 PM EST the active database cluster node became unresponsive. Due to the nature of the node failure the cluster fail-over manager could not complete the fail-over automatically requiring manual intervention. A manual cluster fail-over was initiated at 12:45 PM EDT and completed at 12:58 PM EDT. Recovery was completed and normal operation restored and verification of services was complete at 2:04 PM EDT.

What was the issue or cause?

The database replication and web services timeouts leading up to the node failure were due to a performance issue with the iSCSI SAN. The active database node became unresponsive and an automatic cluster fail-over to the secondary node initiated but could not complete requiring intervention.

What was done to correct the issue?

A recycle of the failing database node and manual cluster fail-over was completed resolving the performance issue.

What is being done to ensure this type of failure is avoided going forward?

SAN resources have been reconfigured to address potential future performance issues related to the Database cluster. The SAN replacement project currently underway has been prioritized and accelerated. The Database nodes are being reliability tested and the appropriate actions will be completed as determined.

Posted 4 months ago. Jun 08, 2018 - 20:53 UTC

Resolved
Cloud services issues have been resolved. Post-mortem to follow once the investigation is complete.
Posted 4 months ago. Jun 06, 2018 - 18:03 UTC
Investigating
We are currently investigating intermittent issues with cloud services. Current services that may be impacted are SMS, Telephony, Push, Location Services, and Certificate Enrollment. We will provide updates as we receive more information.
Posted 4 months ago. Jun 06, 2018 - 17:51 UTC
This incident affected: Enhanced Geolocation Resolution Service (SecureAuth US East Datacenter, SecureAuth US West Datacenter), SMS Service (SecureAuth US East Datacenter, SecureAuth US West Datacenter, Nexmo SMS API), Telephony Service (SecureAuth US East Datacenter, SecureAuth US West Datacenter, Nexmo Voice API), Geolocation Resolution Service (SecureAuth US East Datacenter, SecureAuth US West Datacenter), Push-to-Accept (Login Requests)/Push Notification Service (SecureAuth US East Datacenter, SecureAuth US West Datacenter), Telephony Extension/DTMF Service (SecureAuth US East Datacenter, SecureAuth US West Datacenter), SecureAuth Threat Service (SecureAuth US East Datacenter, SecureAuth US West Datacenter), and X.509 Certificate Service (SHA2) (SecureAuth US East Datacenter, SecureAuth US West Datacenter).