Tuesday, August 23rd was not a particularly good day for Second Life, with an extended period of unscheduled maintenance with log-ins suspended and those in-world advised to refraining from rezzing No Copy objects, or making any LindeX related transactions, etc.
At the time of the problem, there was speculation that it might be due to further issues with the central database node (and kudos for Caitlyn for suggesting this 🙂 ). Writing in a Tools and Technology blog post on August 24th, Operations Team lead April Linden confirmed this was in fact the case:
Shortly after 10:30am [PDT], the master node of one of the central databases crashed. This is the same type of crash we’ve experienced before, and we handled it in the same way. We shut down a lot of services (including logins) so we could bring services back up in an orderly manner, and then promptly selected a new master and promoted it up the chain. This took roughly an hour, as it usually does.
Given this has happened in the relatively recent past (see here and here), the Ops Team are getting pretty good with handling these situations. Except this time there was a slight wrinkle in the proceedings. The previous failures had occurred when concurrency was relatively low due to the times they occurred. This time, however, the problem hit when rather a lot of people were trying to get into SL, so as April notes:
A few minutes before 11:30am [PDT] we started the process of restoring all services to the Grid. When we enabled logins, we did it in our usual method – turning on about half of the servers at once. Normally this works out as a throttle pretty well, but in this case, we were well into a very busy part of the day. Demand to login was very high, and the number of Residents trying to log in at once was more than the new master database node could handle.
Around noon we made the call to close off logins again and allow the system to cool off. While we were waiting for things to settle down we did some digging to try to figure out what was unique about this failure, and what we’ll need to do to prevent it next time.
It wasn’t actually until a third attempt was made to bring up the login hosts one at time that things ran smoothly, with services being fully restored at around 2:30pm PDT.
Now, as April notes, she and her team have a new challenge to deal with: understanding why they had to turn the login servers back on much more slowly than in the past. There is, however, a bright spot in all this: the work put into making the Grid Status feed more resilient paid off, with the service appearing to cope with the load placed on it by several thousand people trying to discover what was going on.
None of us like it when the go wrong, but it’s impossible for SL to be all plain sailing. What is always useful is not only being kept informed about what is going on when things do get messed up (and don’t forget, if you’re on Twitter you can also get grid status updates there as well), but in also being given the opportunity to understand why things went wrong after the fact.
In this respect, April’s blog posts are always most welcome, and continue to be an informative read, helping anyone who does so just what a complicated beast Second life is, and how hard the Lab actually does work to try to keep it running smoothly for all of us – and to get on top of this as quickly as they can when things do go wrong.