Lab explains Second Life’s weekend woes

secondlifeWe’re all used to Second Life misbehaving itself at the weekend, but it with rezzing or rendering or region crossings and so on. However, Saturday, January 9th, and Sunday January 10th proved to be a lot rougher than most weekend in recent memory, with Sunday in particular affecting a lot of SL users.

When situations like this arise, it’s easy to shake a verbal fist at “the Lab” and bemoan the situation whilst forgetting we’re not the only one being impacted. Issues and outages bring disruption to the Lab as well, and often aren’t as easy to resolve as we might think. Hence why it is always good to hear back from the Lab when things do go topsy-turvy – and such is the case with the weekend of the 9th / 10th January.

Posting to the Tools and Technology blog on Monday, January 11th, April Linden, a member of the Operations Team (although she calls herself a “gridbun” on account of her purple bunny avatar), offered a concise explanation as to what happened from the perspective of someone at the sharp end of things.

April starts her account with a description of the first issue to hit the platform:

Shortly after midnight Pacific time on January 9th (Saturday) we had the master node of one of the central databases crash. The central database that happened to go down was one the most  used databases in Second Life. Without it Residents are unable to log in, or do, well, a lot of important things.

While the Lab is prepared for such issues, it does take time to deal with them (in this case around 90 minutes), with services having to be shut-down and then restarted in a controlled manner so as not to overwhelm the affected database. Hence why, when things like this do happen, we often see notices on the Grid Status Page warning us then log-ins may be suspended and /  or to avoid carrying out certain activities.

Sadly, this wasn’t the end of matters; on Sunday an issue with one of the Lab’s providers had a major impact on in-world asset loading (while April doesn’t specifically point at which provider, I’m assuming from her description it may have been one of the CDN providers). While the Lab is versed in working with their providers to analyse the root cause of problems and rectify them, this particular issue appears to have had a knock-on effect in a quite unexpected way, impacting the avatar baking service.

This is the mechanism by which avatar appearances are managed and shared (and is also known as Sever-Side Appearance and / or Server-Side Baking). Designed to overcome limitations with using the viewer / simulator to handle the process, it was cautiously deployed in 2013 after very extensive testing, and it has largely operated pretty reliably since its introduction. As such, the fact that it was so negatively impacted at the weekend appears to have caught the Lab off-guard, with April noting:

One of the things I like about my job is that Second Life is a totally unique and fun environment! (The infrastructure of a virtual world is amazing to me!) This is both good and bad. It’s good because we’re often challenged to come up with a solution to a problem that’s new and unique, but the flip side of this is that sometimes things can break in unexpected ways because we’re doing things that no one else does.

Taking this to be the case, it doubtless took the Lab a while to figure-out how best to deal with the situation, which likely also contributed to the time taken for things to be rectified to the point where people weren’t being so massively impacted. Hopefully, what did occur at the weekend will help the Lab better assess circumstances where such problems – unique as they may be – occur, and determine courses of action to mitigate them in the future.

In the meantime, April’s post, like Landon Linden’s update on the extended issues of May 2014, help remind us of just what a hugely complex beast of systems and services Second Life is, and that how even after 13 years of operations, it can still go wrong in ways that not only frustrate users, but also take the Lab by surprise, despite their best efforts. Kudos to April for presenting the explanation and for apologising for the situation. I hope she, together with all involved, have had time to catch-up on your sleep!

Related Links

6 thoughts on “Lab explains Second Life’s weekend woes

  1. Solace Fairlady

    Thank you Miss Inara, your blog is my go-to for things newsy in SL, as you always explain things in a way I can understand, and you always get the right information to present in the first place as well:) I was lucky on Sunday, I was already inworld when the Grid started to wobble, so I was only affected in minor ways really (unable to attach objects etc), though thankfully I was already dressed and in a complete avatar as otherwise things might have felt a lot different to me:) I think the Lab handled the situation very well, and in fact got the issue resolved pretty quickly, I think the actual disturbances only lasted 2 hours at max and the peak was shorter still. However a lot of residents have all the patience we have come to expect from users of social media these days, nor do they possibly remember what it used to be like a few years ago:) Hats off to April for her post and explanation, it is reassuring to see the spirit of the early Lindens still alive and well:)

    Like

    Reply
  2. Pixel

    We all know that these situations happen and they keeps to happen and after a while they get (more or less) fixed. The problem, instead, is again communication. LL may be catch by surprise and it is not always their own fault; but “residents” are catch by surprise both by the issues and by the subsequent LL unscheduled maintenances. How many people look at the Grid Status Page every day and at every hour or even know that it does exist at all? Some people get it thanks to the tam-tam in some group chats, but besides that LL says nothing and doesn’t warn their costumers about activities that can lead to losses of paid items and transaction, thus causes real money losses to their costumers. And they won’t refund you. Not to mention no-copy items with affective value (ie gifts, memories etc). Either LL finds a way to send grid-wide warnings (as some MMO does) or at least the Viewer should check that there aren’t unscheduled maintenances going on, when someone rezzes a no-copy item or buys something, and in that case it should show a message of what is going on, for example. I don’t think people working at LL costumer care team is happy as well, when costumers are catch by surprise by unscheduled maintenances. Neither the costumers, when they are answered “sorry we can’t do nothing to recover your items and we won’t refund you [for the damage we caused to you]”. Which sounds like they don’t even make backups of their database or whatever reason they can’t.

    Like

    Reply
    1. Inara Pey Post author

      I did actually spend time under Rod Humble’s tenure campaigning for a return of the in-world notification for when Things Were Going Wrong – and this included several direct conversations with Rod himself. I certainly wasn’t alone in this – many others also wanted to see it back and blogged on the subject. Sadly, nothing ever came of it, other than various technical reasons why the notification couldn’t be re-implemented (and what I felt at the time was a terrible weak excuse for its non-return: it annoyed residents …).

      Maybe it is something to start poking Ebbe about again …

      Like

      Reply

Have any thoughts?

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s