On 12th of August at 10:36 UTC the game started to kick players out and did not let new players in. The problems lasted for 4 hours and 40 minutes, which is by far the longest downtime we’ve ever had. I thought it might make sense to write a bit about what happened. I’ll try to keep things on a high level, but some of it will be quite technical.
We are using MongoDB as our database. We are running it in a so-called replica set, where most of the work is done on the primary database, and the changes are replicated to two secondaries for high availability. Some of the changes are so important that we want the primary to wait until the change has been acknowledged by at least one of the secondaries before continuing.
One of our engineers was in the process of updating an index on the collection holding the player data. Indexes are used to speed up queries targeting the collection. MongoDB does not support updating an index directly - instead, we create a new index and then drop the old one. Creating an index takes a long time, so we were doing it in the background. At 10:20 UTC the index creation completed on the primary database. At 10:36 UTC the engineer dropped the old index, and a few seconds later both of our secondary databases became unhealthy. The primary continued to work for a little while longer but eventually became unhealthy too.
What the engineer did not realize at the time was that the index creation starts on the secondaries only after the index has been created on the primary. Dropping an index completes immediately, so the old index was dropped from the secondaries while the creation of the new one had just started. When both secondaries became unhealthy, none of the changes could be replicated anymore. Because the primary waits for an acknowledgement from a secondary for the most important changes, the operations were stuck forever and eventually, the primary database ran out of resources.
Over the next hours, we tried various things to restore the normal functionality of our replica set. Unfortunately, it started to look less and less likely to succeed. We could have recovered the database, but we were worried about potential data loss. While we worked to restore the system, the game was online periodically, but the user experience and stability were poor. In the end, we made the decision to restore the replica set from a backup. We take regular backups, but we were fortunate that the latest backup was dated 10:33 UTC, only 3 minutes before the incident occurred.
We re-opened the servers at 15:16 UTC.
When the game was restored, we wanted to understand what had caused the problem. Our server reads some non-critical things from the secondary databases for performance reasons so that the primary can focus on more critical things. While trying to recover the server, we worked with the assumption that the secondaries became unhealthy because they no longer had either of the indexes - one was being created, and the other one had just been deleted. On later investigation, it turned out that none of the queries that need the index are accessing the secondaries. Even though dropping the index too soon was a mistake, it should not have caused any visible issues in the game. Instead, we had encountered the following bug in MongoDB: https://jira.mongodb.org/browse/SERVER-21307.
We take downtime like this very seriously. The team already practices disaster recovery routinely to prepare us for this kind of situation, but it doesn’t mean that we couldn’t do better. We held a post mortem on the incident and created almost 20 new action points based on the findings.
I can’t promise there won’t be similar outages in the future, but we will do our very best to learn from this one and make sure it will not happen again. Thank you for your understanding and your continued support!