Post Mortem on the Server Downtime on the 12th of August

On 12th of August at 10:36 UTC the game started to kick players out and did not let new players in. The problems lasted for 4 hours and 40 minutes, which is by far the longest downtime we’ve ever had. I thought it might make sense to write a bit about what happened. I’ll try to keep things on a high level, but some of it will be quite technical.

Background

We are using MongoDB as our database. We are running it in a so-called replica set, where most of the work is done on the primary database, and the changes are replicated to two secondaries for high availability. Some of the changes are so important that we want the primary to wait until the change has been acknowledged by at least one of the secondaries before continuing.

One of our engineers was in the process of updating an index on the collection holding the player data. Indexes are used to speed up queries targeting the collection. MongoDB does not support updating an index directly - instead, we create a new index and then drop the old one. Creating an index takes a long time, so we were doing it in the background. At 10:20 UTC the index creation completed on the primary database. At 10:36 UTC the engineer dropped the old index, and a few seconds later both of our secondary databases became unhealthy. The primary continued to work for a little while longer but eventually became unhealthy too.

The Problem

What the engineer did not realize at the time was that the index creation starts on the secondaries only after the index has been created on the primary. Dropping an index completes immediately, so the old index was dropped from the secondaries while the creation of the new one had just started. When both secondaries became unhealthy, none of the changes could be replicated anymore. Because the primary waits for an acknowledgement from a secondary for the most important changes, the operations were stuck forever and eventually, the primary database ran out of resources.

Over the next hours, we tried various things to restore the normal functionality of our replica set. Unfortunately, it started to look less and less likely to succeed. We could have recovered the database, but we were worried about potential data loss. While we worked to restore the system, the game was online periodically, but the user experience and stability were poor. In the end, we made the decision to restore the replica set from a backup. We take regular backups, but we were fortunate that the latest backup was dated 10:33 UTC, only 3 minutes before the incident occurred.

We re-opened the servers at 15:16 UTC.

Explanation

When the game was restored, we wanted to understand what had caused the problem. Our server reads some non-critical things from the secondary databases for performance reasons so that the primary can focus on more critical things. While trying to recover the server, we worked with the assumption that the secondaries became unhealthy because they no longer had either of the indexes - one was being created, and the other one had just been deleted. On later investigation, it turned out that none of the queries that need the index are accessing the secondaries. Even though dropping the index too soon was a mistake, it should not have caused any visible issues in the game. Instead, we had encountered the following bug in MongoDB: https://jira.mongodb.org/browse/SERVER-21307.

Conclusion

We take downtime like this very seriously. The team already practices disaster recovery routinely to prepare us for this kind of situation, but it doesn’t mean that we couldn’t do better. We held a post mortem on the incident and created almost 20 new action points based on the findings.

I can’t promise there won’t be similar outages in the future, but we will do our very best to learn from this one and make sure it will not happen again. Thank you for your understanding and your continued support!

229 Likes

Thank you for the open communication, it is appreciated :slightly_smiling_face:

26 Likes

Thank you for taking time to do a detailed post-mortem.

18 Likes

Many thinks @mhalttu and the rest of the team! From the sounds of things it was indeed fortunate that the backup was only a short time before the issue started!

Keep up the awesome work & I appreciate everything that you all do in the background behind the scenes to make this game as fun as it is for so many players :slight_smile:

27 Likes

:see_no_evil: Now we just use it as cache, too many issues…

Many thanks for the explanation :wink:

2 Likes

As above, thanks for the explanations. These things happen and it’s good to have the transparency on the reasons. I don’t really understand it all (much like when the IT boffins explain this stuff at my work) but I do appreciate it when you do so anyways :slight_smile:

10 Likes

Thanks for the explanation, it’s good to know what caused the outage

3 Likes

Thank you, I am guessing a new standard operating procedure has been written for engineers :slight_smile: It is great to get the detailed explanation.

3 Likes

If i would always get such detailed RCAs in my job i would be happy.
Thank you @mhalttu

10 Likes

Thanks for the explanation @mhalttu. Most importantly, thank you for taking care of it in a timely and professional manner. You guys rock. :slight_smile:

12 Likes

Thank-you for the explanation.
This is just the sort of communication that the player base is looking for from SG.
More communication can only help, and hopefully some of the lessons learned from this were not merely technical, but also communication related.

Thanks again @mhalttu

18 Likes

Thank you for taking the time to explain what happened @mhalttu

And please thank all the people who worked hard and fast for us

8 Likes

Thanks for the insights! A lot of technical details but I appreciate it never the less.
Explaining such an issue is important. More Communication is highly welcome.

3 Likes

Grateful for the clarifications. This is something that can happen to anyone, bones of the trade. But taking the pack, suggest the team to review the drop percentage of the 4 * ascension items, this is making the game frustrating and discouraging.

2 Likes

Many thanks for the explanation. it is very appreciable to have this communication, especially so clear and complete. Thanks for all your hard work @mhalttu and all the staff :+1:

5 Likes

Even triple redundancy can’t be enough. :wink: :stuck_out_tongue_winking_eye:

The downtime made some of us remember real life. :rofl:

4 Likes

Why doesn’t this ring a bell? Give us reset tokens! I’m starving for 2

Luckily that the back up is just 3 mins before the outage, otherwise damages would be very large…

2 Likes

Don’t understand server stuff, but the explanation is appreciated. :slightly_smiling_face:

6 Likes