Intermittent Issues with US East Cloud Services
Incident Report for SecureAuth Service
Postmortem

Incident Description

At 12:15 UTC on 9/8/2018 intermittent failures in Push Notifications, Risk, and Geolocation services were reported.  Investigation showed that the main database server in our primary data center database cluster was at 100% CPU.  This caused intermittent failures in Push Notifications, Risk, and Geolocation services.  Impact of this issue was very limited, affecting only a few customers.

After initial identification of the problem, impacted customers were redirected to our secondary data center which was unaffected by the issue.  CPU usage returned to normal at 13:00 UTC.  After internal monitoring determined the systems was stable again, impacted customers were failed back to our primary data center.

Root Cause

The primary database server in our database cluster in our primary data center suffered an extended CPU spike which was caused by an unexpected internal database maintenance job.  This caused internal service timeouts to be reached for many requests, which impacted Push Notifications, Risk, and Geolocation services.

Corrective Actions

Enhanced monitoring and alerting have been implemented.  Priority has been given to provide additional resources to the database cluster.

Posted about 2 months ago. Oct 01, 2018 - 19:53 UTC

Resolved
This incident has been resolved. RCA will be posted in the next few days.
Posted 2 months ago. Sep 09, 2018 - 02:51 UTC
Monitoring
All End Points are currently working. Service was restored at ~08:20am Pacific. We will continue to monitor and will update this with an RCA once the investigation is complete.
Posted 2 months ago. Sep 08, 2018 - 16:45 UTC
Investigating
We are currently investigating intermittent issues with cloud services. Current services that may be impacted are SMS, Telephony, Push, Location Services, and Certificate Enrollment. We will provide updates as we receive more information.
Posted 2 months ago. Sep 08, 2018 - 14:00 UTC
This incident affected: Telephony Extension/DTMF Service (SecureAuth US East Datacenter, SecureAuth US West Datacenter), Telephony Service (SecureAuth US East Datacenter, SecureAuth US West Datacenter, Nexmo Voice API), X.509 Certificate Service (SHA2) (SecureAuth US East Datacenter, SecureAuth US West Datacenter), Enhanced Geolocation Resolution Service (SecureAuth US East Datacenter, SecureAuth US West Datacenter), Geolocation Resolution Service (SecureAuth US East Datacenter, SecureAuth US West Datacenter), Push-to-Accept (Login Requests)/Push Notification Service (SecureAuth US East Datacenter, SecureAuth US West Datacenter), SecureAuth Threat Service (SecureAuth US East Datacenter, SecureAuth US West Datacenter), and SMS Service (SecureAuth US East Datacenter, SecureAuth US West Datacenter, Nexmo SMS API).