On Wednesday, June 6 2018, at 1:22 PM EDT the SecureAuth Cloud monitoring systems alerted the engineering team to a Geo-location database failure at the US1 data center. An investigation of the issue was initiated at that time and confirmed there were intermittent database timeouts occurring on that database and the cause of the timeouts was being investigated. At 1:39 PM EST the active database cluster node became unresponsive. Due to the nature of the node failure the cluster fail-over manager could not complete the fail-over automatically requiring manual intervention. A manual cluster fail-over was initiated at 12:45 PM EDT and completed at 12:58 PM EDT. Recovery was completed and normal operation restored and verification of services was complete at 2:04 PM EDT.
The database replication and web services timeouts leading up to the node failure were due to a performance issue with the iSCSI SAN. The active database node became unresponsive and an automatic cluster fail-over to the secondary node initiated but could not complete requiring intervention.
A recycle of the failing database node and manual cluster fail-over was completed resolving the performance issue.
SAN resources have been reconfigured to address potential future performance issues related to the Database cluster. The SAN replacement project currently underway has been prioritized and accelerated. The Database nodes are being reliability tested and the appropriate actions will be completed as determined.