Incident Description
At approximately 2125 UTC on May 22, internal monitoring alerted us to an issue in SMS message delivery. Logging showed delayed messages and a very small percentage of undelivered messages. Corrective actions were taken and normal operation resumed at about 2225 UTC.
Root Cause
After detailed investigation with our network providers and others, it was determined that an extreme spike in volume had caused resource exhaustion on one of our servers, which resulted in failed SMS delivery. The load was determined to be test traffic that had been incorrectly directed to a production server.
Corrective Actions
Extensive discussions have taken place with the relevant parties. In addition, we have trained our staff on how to identify single-souce-IP traffic spikes and how to properly activate rate limiting tools. We have also adjusted the parameters of our alerting system to more quickly react to over-volume as well as delivery delays.