Incident Description
At approximately 0920 on 5/10/2019 internal monitoring detected delayed or failed SMS delivery in our US1 data center. Investigation initially indicated and issue with the one of our servers, which was removed from the load balancer pool and rebooted. Upon being added back into the load balancer pool, the delayed SMS delivery issue persisted. At approximately 0934 the issue self-resolved and normal operation was restored.
Root Cause
After extensive investigation with our network providers it was determined that an issue with some of the data center equipment was causing inbound requests to queue up for a period of 10-15 seconds and then be delivered to the server within a few hundred ms, which exceed the server’s capacity to respond.
Corrective Actions
We have aggressively engaged with the network provider to resolve the issue. In addition we are adding traffic gap monitors to our internal monitoring to more rapidly detect a recurrence of this or a similar issue. We are also updating our Level 1 response playbooks to move the check of traffic patterns and volumes earlier in the sequence of diagnostic steps.