Investigating Intermittent Issues with Cloud Services
Incident Report for SecureAuth Service
Postmortem

Incident Description

On 2017-12-17 at 11:13 CST, SecureAuth customers began experiencing issues with delivery of SMS and TTS OTP messages. Full availability of the impacted services was restored at 13:16 CST.

Root Cause

After a thorough analysis and review of logs, we have determined the following issues were the primary contributors to the failure:

• It was determined at 11:13 CST a database cluster failure occurred resulting in the database state changing to read only due to a disk capacity limitation. Once identified, the disk issue was corrected which restored write function of the database. Full availability of impacted services was restored at 13:16 CST.
• The health monitoring system for the SQL cluster captured the cluster failure and disk capacity warnings but failed to dispatch an alert to the SecureAuth infrastructure team in time to resolve the issue or reduce impact on services.

Corrective Actions

The following measures are being taken to prevent an incident of this type from happening in the future:
• A support case has been opened with the health monitoring solution software vendor to determine why the product failed to issue the notifications and alerts for this database cluster.
• Datacenter support services and SecureAuth’s internal health monitoring thresholds and escalation procedures are being adjusted and tested to ensure notifications and alerts are issued expeditiously.
• Additional health monitoring service alerts are being configured with a tertiary service provider.

Posted Dec 18, 2017 - 15:25 PST

Resolved
This incident has been resolved.
Posted Dec 17, 2017 - 15:37 PST
Monitoring
The issue is now resolved. We will continue to monitor our services to ensure there are no further issues. A root cause analysis will be provided.
Posted Dec 17, 2017 - 11:30 PST
Investigating
We are currently investigating intermittent issues with our cloud services. SMS, Voice, Push, and Certificate Enrollment services may be affected at this time. Our team is working to resolve the issue, and an update will be posted as more information is available.
Posted Dec 17, 2017 - 11:20 PST