Incident Description
On 2017-07-19 starting at 9:29 PM until 10:31 PM PDT SecureAuth customers experienced intermittent delivery of SMS OTP messages. SecureAuth cloud monitoring alerted the engineering team of SMS services failures and began an investigation was initiated. It was determined that communication failures were occurring with the Primary sms and telephony provider's API. Nexmo acknowledged intermittent connectivity issues and due to a failure at one of their US data centers. SecureAuth Engineering team monitored SMS and Telephony services after the initial communication issue was resolved but determined the provider's services were intermittently delayed therefore all SMS traffic was routed to our secondary provider. Fail-over to the secondary provider stabilized delivery of SMS messages was completed at 10:31 PM PDT.
Root Cause
After a thorough analysis and review of logs we have determined the following issue was the primary contributor to the failure:
Nexmo experienced a network hardware failure in one of their US data centers which initially caused intermittent communication failures to their API infrastructure and intermittent delivery delays of SMS and Telephony messages. See full description here: https://www.nexmostatus.com/incidents/gk9nmqv859z6 .
Corrective Actions
The following measures are being taken to prevent an incident of this type from happening in the future:
The official RCA from Nexmo was received on 7/28 and reviewed. SecureAuth Engineering met with Nexmo Support and Operations teams on 8/1 and 8/2 to discuss outstanding questions and to review short and long term remediation plans. - The affected hardware issue was resolved immediately and fail-over monitoring was reviewed and improved. - Over the next 30-90 days additional improvements are planned around failure detection and fail-over automation efficacy. - SecureAuth and Nexmo will convene regular cadence calls in support of our strategic partnership.