April Linden blogs on the May 13th/14th downtime

The week of May 13th-17th saw a planned period of Second Life network maintenance work, as announced in the Grid Status updates.

The first tranche of this work – Monday, May 13th through Tuesday May 14th – appeared to go well, until there was a completely unexpected 4(ish) hours of downtime, which at the time caused significant upset.

On May 17th, April Linden, the Second Life Operations Manager, has provided an insightful blog post on both the work being carried out and the cause of the downtime.

This week we were doing much needed maintenance on the network that powers Second Life. The core routers that connect our data centre to the Internet were nearing their end-of-life, and needed to be upgraded to make our cloud migration more robust.

Replacing the core routers on a production system that’s in very active use is really tricky to get right. We were determined to do it correctly, so we spent over a month planning all of the things we were going to do, and in what order, including full roll-back plans at each step. We even hired a very experienced network consultant to work with us to make sure we had a really good plan in place, all with the goal of interrupting Second Life as little as we could while improving it …

Everything started out great. We got the first new core router in place and taking traffic without any impact at all to the grid. When we started working on the second core router, however, it all went wrong.

– Extract from April Linden’s blog post

In essence, a cable had to be relocated, which was expected to cause a very brief period of impact. However, things didn’t recover as anticipated, and April resumes her explanation:

After the shock had worn off we quickly decided to roll back the step that failed, but it was too late. Everyone that was logged into Second Life at the time had been logged out all at once. Concurrency across the grid fell almost instantly to zero. We decided to disable logins grid-wide and restore network connectivity to Second Life as quickly as we could.

At this point we had a quick meeting with the various stakeholders, and agreed that since we were down already, the right thing to do was to press on and figure out what happened so that we could avoid it happening again…

This is why logins were disabled for several hours. We were determined to figure out what had happened and fix the issue, because we very much did not want it to happen again. We’ve engineered our network in a way that any piece can fail without any loss of connectivity, so we needed to dig into this failure to understand exactly what happened.

– Extract from April Linden’s blog post

April Linden

In other words, while it may have been painful for those who were unceremoniously dumped from Second Life and found they could not get back in, the Lab were working with the best of intentions: trying to find out exactly why connectivity was lost within a network where such an event should not cause such a drastic breakage – and its worth noting that as per April’s blog post, even the engineers from the manufacturer of the Lab’s network equipment were perplexed by what happened.

As always, April’s blog post makes for an invaluable read in understanding some of the complexities of Second Life, and goes so far as to answer a question raised on the forums in the wake of the week’s problems: Why didn’t LL tell us exactly when this maintenance was going to happen? – in short there are bad actors in the world who could make use of publicly available announcements that give them precise information on when a network might be exposed.

If you’ve not read April’s blog posts on operational issues like this, I really cannot recommend them enough – and thanks are again offered April for providing this post. And while things might have hurt at the time, there is a silver lining to things, as she notes:

Second Life is now up and running with new core routers that are much more powerful than anything we’ve had before, and we’ve had a chance to do a lot of failure testing. It’s been a rough week, but the grid is in better shape as a result.

Advertisements

2 thoughts on “April Linden blogs on the May 13th/14th downtime

  1. April’s statement that “Concurrency across the grid fell almost instantly to zero.” is not *quite* true. There was a small handful of residents who remained on until the last half hour. I first noticed this on Editsup and was able to confirm it with a friend who knew one of those people. The person who stayed on is, unfortunately, a newbie and remembers only “somewhere fishing while afk”. I think 1 or 2 simulators were still connected, so roughly 30 regions.

    Like

Have any thoughts?

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.