Lab explains Second Life’s weekend woes

We’re all used to Second Life misbehaving itself at the weekend, but it with rezzing or rendering or region crossings and so on. However, Saturday, January 9th, and Sunday January 10th proved to be a lot rougher than most weekend in recent memory, with Sunday in particular affecting a lot of SL users.

When situations like this arise, it’s easy to shake a verbal fist at “the Lab” and bemoan the situation whilst forgetting we’re not the only one being impacted. Issues and outages bring disruption to the Lab as well, and often aren’t as easy to resolve as we might think. Hence why it is always good to hear back from the Lab when things do go topsy-turvy – and such is the case with the weekend of the 9th / 10th January.

Posting to the Tools and Technology blog on Monday, January 11th, April Linden, a member of the Operations Team (although she calls herself a “gridbun” on account of her purple bunny avatar), offered a concise explanation as to what happened from the perspective of someone at the sharp end of things.

April starts her account with a description of the first issue to hit the platform:

Shortly after midnight Pacific time on January 9th (Saturday) we had the master node of one of the central databases crash. The central database that happened to go down was one the most  used databases in Second Life. Without it Residents are unable to log in, or do, well, a lot of important things.

While the Lab is prepared for such issues, it does take time to deal with them (in this case around 90 minutes), with services having to be shut-down and then restarted in a controlled manner so as not to overwhelm the affected database. Hence why, when things like this do happen, we often see notices on the Grid Status Page warning us then log-ins may be suspended and /  or to avoid carrying out certain activities.

Sadly, this wasn’t the end of matters; on Sunday an issue with one of the Lab’s providers had a major impact on in-world asset loading (while April doesn’t specifically point at which provider, I’m assuming from her description it may have been one of the CDN providers). While the Lab is versed in working with their providers to analyse the root cause of problems and rectify them, this particular issue appears to have had a knock-on effect in a quite unexpected way, impacting the avatar baking service.

This is the mechanism by which avatar appearances are managed and shared (and is also known as Sever-Side Appearance and / or Server-Side Baking). Designed to overcome limitations with using the viewer / simulator to handle the process, it was cautiously deployed in 2013 after very extensive testing, and it has largely operated pretty reliably since its introduction. As such, the fact that it was so negatively impacted at the weekend appears to have caught the Lab off-guard, with April noting:

One of the things I like about my job is that Second Life is a totally unique and fun environment! (The infrastructure of a virtual world is amazing to me!) This is both good and bad. It’s good because we’re often challenged to come up with a solution to a problem that’s new and unique, but the flip side of this is that sometimes things can break in unexpected ways because we’re doing things that no one else does.

Taking this to be the case, it doubtless took the Lab a while to figure-out how best to deal with the situation, which likely also contributed to the time taken for things to be rectified to the point where people weren’t being so massively impacted. Hopefully, what did occur at the weekend will help the Lab better assess circumstances where such problems – unique as they may be – occur, and determine courses of action to mitigate them in the future.

In the meantime, April’s post, like Landon Linden’s update on the extended issues of May 2014, help remind us of just what a hugely complex beast of systems and services Second Life is, and that how even after 13 years of operations, it can still go wrong in ways that not only frustrate users, but also take the Lab by surprise, despite their best efforts. Kudos to April for presenting the explanation and for apologising for the situation. I hope she, together with all involved, have had time to catch-up on your sleep!

Related Links