Second Life’s August 23rd woes by April Linden

secondlifeTuesday, August 23rd was not a particularly good day for Second Life, with an extended period of unscheduled maintenance with log-ins suspended and those in-world advised to refraining from rezzing No Copy objects, or making any LindeX related transactions, etc.

At the time of the problem, there was speculation that it might be due to further issues with the central database node (and kudos for Caitlyn for suggesting this🙂 ). Writing in a Tools and Technology blog post on August 24th, Operations Team lead April Linden confirmed this was in fact the case:

Shortly after 10:30am [PDT], the master node of one of the central databases crashed. This is the same type of crash we’ve experienced before, and we handled it in the same way. We shut down a lot of services (including logins) so we could bring services back up in an orderly manner, and then promptly selected a new master and promoted it up the chain. This took roughly an hour, as it usually does.

Given this has happened in the relatively recent past (see here and here), the Ops Team are getting pretty good with handling these situations. Except this time there was a slight wrinkle in the proceedings. The previous failures had occurred when concurrency was relatively low due to the times they occurred. This time, however, the problem hit when rather a lot of people were trying to get into SL, so as April notes:

A few minutes before 11:30am [PDT] we started the process of restoring all services to the Grid. When we enabled logins, we did it in our usual method – turning on about half of the servers at once. Normally this works out as a throttle pretty well, but in this case, we were well into a very busy part of the day. Demand to login was very high, and the number of Residents trying to log in at once was more than the new master database node could handle.

Around noon we made the call to close off logins again and allow the system to cool off. While we were waiting for things to settle down we did some digging to try to figure out what was unique about this failure, and what we’ll need to do to prevent it next time.

It wasn’t actually until a third attempt was made to bring up the login hosts one at time that things ran smoothly, with services being fully restored at around 2:30pm PDT.

Now, as April notes, she and her team have a new challenge to deal with: understanding why they had to turn the login servers back on much more slowly than in the past. There is, however, a bright spot in all this: the work put into making the Grid Status feed more resilient paid off, with the service appearing to cope with the load placed on it by several thousand people trying to discover what was going on.

None of us like it when the go wrong, but it’s impossible for SL to be all plain sailing. What is always useful is not only being kept informed about what is going on when things do get messed up (and don’t forget, if you’re on Twitter you can also get grid status updates there as well), but in also being given the opportunity to understand why things went wrong after the fact.

In this respect, April’s blog posts are always most welcome, and continue to be an informative read, helping anyone who does so just what a complicated beast Second life is, and how hard the Lab actually does work to try to keep it running smoothly for all of us – and to get on top of this as quickly as they can when things do go wrong.

6 thoughts on “Second Life’s August 23rd woes by April Linden

  1. Wolf Baginski (@WolfBaginski)

    I check the blog report, and it reads almost as if nobody knows about timezones. Fair enough that all the times given are in PDT, but the initial crash was early evening in Europe, and it would be surprising if the log-in rate didn’t increase at that time. The Lab even published data on some of this stuff, though that stopped a long time ago. I recall Tatero Nino had some of the figure on her website.

    Then I look at what I did: essentially I logged on, found serious problems of the sort that are often blamed on a “bad connection”, with the advice to re-log, and then the log-in failed. And I went and found something else to do. The cry of “It’s your connection” is plausible, but I am feeling a little more wearily sceptical about it today. It’s an easy answer, maybe too easy.

    It’s all the standard warnings, and it does leave the lurking doubt whether the L$ totals, and the inventory, are correctly preserved. It is, I reckon, a bit early for the East Coast USA surge to start, but I can’t have been the only one wanting to check.

    And the Status page did overload. I saw it. So did people on Twitter.

    In the end, they did a good job, but I wonder if there need to be a few more clocks, set to different timezones, in what passes for an ops room. Europe, Japan, and East Coast come to mind as strong possibilities. Because they managed to sound a little bit too surprised.

    Liked by 1 person

    Reply
  2. orcaflotta

    The “It’s your connection” explainification once was probably very true BUT it’s 2016 now ffs. I remember when I first logged into SL, in January of 2007, it was on a weak-ish laptop without any dedicated graphics card and on a DSL 128 connection. It didn’t even met LL’s minimum requirements and I spent my first half year in world on 64 meters DD and unable to go to clubs with many ppl in. Of course I crashed and it was my, respectively my connection’s, fault. But as everybody else, my ISP and the state owned infrastructure and my computers became cheaper and better and my connectivity nowadays is fast and stable, as is my dedicated SL machine. And I guess I’m not the only one who has learned and is better prepaed to met SL’s demands.
    So I guess they should spare us and themselves the disgraceful blame shifting and look into getting better hardware themselves.

    Like

    Reply
    1. Inara Pey Post author

      “So I guess they should spare us and themselves the disgraceful blame shifting…”

      April’s blog post doesn’t shift blame. Rather the reverse, it’s an open acknowledgement that something at their end went wrong, what needed to be done to rectify things, coupled with a promise that they are attempting to understand what is failing, why, and how best to address it.

      “…and look into getting better hardware themselves.”

      One of the reasons why there has been a paucity of server deployments over the last few months is because the Lab has (again) been engaged on a series of infrastructure updates which include hardware and software / operating systems, as I’ve commented on on my weekly SL project updates.

      Like

      Reply
  3. Shug Maitland

    I have said this before, the SL Grid Status page is, quite understandably slow to respond, no way the Lindens are going to post anything while they are still just awakening to the fact that there is a problem.
    http://etitsup.com/slstats/ is a running 24 hour curve of SL concurrence with a 5 minute update cycle. When things go sideways that is where I check first, if the graph is dropping uncharacteristically the problem is not me or my network connection and I might as well just hold on and wait it out. There are very few catistrophic grid wide issues that do not show up clearly on the etitsup graph.

    Like

    Reply
  4. Taylor

    “There is, however, a bright spot in all this: the work put into making the Grid Status feed more resilient paid off, with the service appearing to cope with the load placed on it by several thousand people trying to discover what was going on.”

    The grid status page wasn’t loading for me (or friends) for quite a while when the issues first started. There was a page with a progress bar but it never got past that. I had to check the @SLGridStatus Twitter feed instead.

    Like

    Reply
  5. Kyllein MacKellerann

    One almost gets the feeling that both the Central node and the login servers were overtaxed, causing them to fail. I also notice that when I’m in-world I get slow loading, almost as if I have a poor or slow connection (I don’t and have the monthly bills to prove it). I wonder if maybe there is too much happening in SL for the system to handle in a timely manner? Consider: Every update does more than the previous one. Every client update has more services attached (Firestorm is a prime example of this). I wonder if possibly all these “goodies” aren’t loading the servers and system nodes and primary nodes to the point of failure? Is it possible that the improvements being created are the source of the problems? Are both Second Life and the Client providers the actual source of, rather than the victims of, this problem. We’ll learn which is which as time goes by; if more goodies equals more problems, both SL and the Client providers may have to scale back to keep the experience going rather than crashing. Just because a processor can run at “OhMyGod” speeds doesn’t mean it will do so without problems. Simpler may wind up being better in the long run.

    Like

    Reply

Have any thoughts?

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s