Incident Description
On 2017-12-17 at 11:13 CST, SecureAuth customers began experiencing issues with delivery of SMS and TTS OTP messages. Full availability of the impacted services was restored at 13:16 CST.
Root Cause
After a thorough analysis and review of logs, we have determined the following issues were the primary contributors to the failure:
• It was determined at 11:13 CST a database cluster failure occurred resulting in the database state changing to read only due to a disk capacity limitation. Once identified, the disk issue was corrected which restored write function of the database. Full availability of impacted services was restored at 13:16 CST.
• The health monitoring system for the SQL cluster captured the cluster failure and disk capacity warnings but failed to dispatch an alert to the SecureAuth infrastructure team in time to resolve the issue or reduce impact on services.
Corrective Actions
The following measures are being taken to prevent an incident of this type from happening in the future:
• A support case has been opened with the health monitoring solution software vendor to determine why the product failed to issue the notifications and alerts for this database cluster.
• Datacenter support services and SecureAuth’s internal health monitoring thresholds and escalation procedures are being adjusted and tested to ensure notifications and alerts are issued expeditiously.
• Additional health monitoring service alerts are being configured with a tertiary service provider.