Incident Description
At 12:15 UTC on 9/8/2018 intermittent failures in Push Notifications, Risk, and Geolocation services were reported. Investigation showed that the main database server in our primary data center database cluster was at 100% CPU. This caused intermittent failures in Push Notifications, Risk, and Geolocation services. Impact of this issue was very limited, affecting only a few customers.
After initial identification of the problem, impacted customers were redirected to our secondary data center which was unaffected by the issue. CPU usage returned to normal at 13:00 UTC. After internal monitoring determined the systems was stable again, impacted customers were failed back to our primary data center.
Root Cause
The primary database server in our database cluster in our primary data center suffered an extended CPU spike which was caused by an unexpected internal database maintenance job. This caused internal service timeouts to be reached for many requests, which impacted Push Notifications, Risk, and Geolocation services.
Corrective Actions
Enhanced monitoring and alerting have been implemented. Priority has been given to provide additional resources to the database cluster.