Of outages and feedback

secondlifeI normally keep a close eye on outgoing communications from the Lab, but this week I’ve had other things distracting me, and so haven’t been keeping an eye on the official blog for posts and updates. My thanks therefore to reader BazdeSantis for pointing me to April Linden’s Tools and Technology update, The Story Behind Last Week’s Unexpected Downtime.

April has very much become the voice of the Lab’s Operations team, and has provided us with some excellent insights to Why Things Sometimes Went Wrong – a valuable exercise as it increases both our understanding of the complexities inherent in Second Life, and also what is likely to be going on behind the scenes when things do go drastically sideways.

April’s post refers to the issues experienced on Friday May 6th, when a primary node of a central database failed, with April noting:

The database node that crashed holds some of the most core data to Second Life, and a whole lot of things stop working when it’s inaccessible, as a lot of Residents saw.

When the primary node in this database is off-line we turn off a bunch of services, so that we can bring the grid back up in a controlled manner by turning them back on one at a time.

There’s an interesting point to note here. This is the same – or very similar – issue to that which occurred in January 2016, which again goes to show that given the constant usage it sees, Second Life is a volatile service  – and that the Operations team are capable of turning major issues around in a remarkably short time; around 90 minutes in January, and less than an hour this last time.

Both events were also coupled with unexpected parallel issues as well: in January,  the database issue was followed by issues with one of the Lab’s service providers – which did take a while to sort out. This time it was the Grid Status service. As I’ve recently reported, the Grid Status web pages have recently moved to a new provider. A couple of changes resulting from this have been with the RSS Feed, and integrating the Grid Status reporting pages with the rest of the Lab’s blog / forum Lithium service. However, as April also notes:

It can be really hard to tune a system for something like a status blog, because the traffic will go from its normal amount to many, many times that very suddenly. We see we now have some additional tuning we need to do with the status blog now that it’s in its new home.

She also points out that people with Twitter can also track the situation with Second Life by following the Grid Status Twitter account.

April’s posts are always welcome and well worth reading, and this one is no exception. We obviously don’t like things when the go wrong, but it’s impossible for SL to be all plain sailing. So, as I’ve said before (and above), hearing just what goes on behind the scenes to fix things when the do go wrong helps remind and inform us just how hard the Lab actually doe work to keep the complexities of a 13-year-old platform running for us to enjoy.

 

2 thoughts on “Of outages and feedback

  1. Shug Maitland

    I have sung this song before:
    – Host Grid Status completely separately from anything else second life, putting it in the official SL blogs is just looking for trouble!
    – The Google calendar on the grid status page should include all Scheduled maintenance, not just when they might do restarts.
    – Establish a SL concurrency database like we used to have at http://etitsup.com/ This gave a quick sanity check when we were having problems (is it just me or the whole grid?) and required no live interaction from the Lindens.
    – The Twitter account seems to be available to those of us who do not “social network” now, but is another thing for the Lindens to post to when they are busy trying to fix a problem.

    Like

    Reply
    1. Inara Pey Post author

      “Host Grid Status completely separately from anything else second life”

      It is🙂 .

      The Grid Status feed, etc., is provided through a third party. As April notes, the issues which occurred with it at the time of the 6th May problems were a result of the service needing further fine tuning to be able to meet the demand placed upon it.

      Like

      Reply

Have any thoughts?

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s