April Linden explains August 22nd’s Second Life woes

Tuesday, August 22nd was not a particularly good day for Second Life, with an extended period of unscheduled maintenance with log-ins suspended and those in-world advised to refraining from rezzing No Copy objects, or making any LindeX related transactions, etc.

If these words sound familiar (except the date), it’s because I wrote them a year ago to the day, on August 23rd, 2016, when Second Life experienced some significant issues.

Back then, the problem was the core database. The initial problems on August 22nd, 2017 weren’t software related, nor were they related to the Main (SLS) channel deployment taking place at the time. Instead, they lay with a piece of hardware, as April Linden, writing in the Tools and Technology blog, explained in another concise explanation of the problem, which started:

Early this morning (during the grid roll, but it was just a coincidence) we had a piece of hardware die on our internal network. When this piece of hardware died, it made it very difficult for the servers on the grid to figure out how to convert a human-readable domain name, like www.secondlife.com, into IP addresses, like 216.82.8.56.

Everything was still up and running, but none of the computers could actually find each other on our network, so activity on the grid ground to a halt. The Second Life grid is a huge collection of computers, and if they can’t find other, things like switching regions, teleports, accessing your inventory, changing outfits, and even chatting fail. This caused a lot of Residents to try to relog.

We quickly rushed to get the hardware that died replaced, but hardware takes time – and in this case, it was a couple of hours. It was very eerie watching our grid monitors. At one point the “Logins Per Minute” metric was reading “1,” and the “Percentage of Successful Teleports” was reading “2%.” I hope to never see numbers like this again.

Unfortunately, as April went on to explain, the problems didn’t end there, as the log-in service got into something of a mismatch once the hardware issue had been resolved. Whilst telling viewers attempting to log-in to the grid their attempts were unsuccessful, the service was telling the simulators the log-ins had been successful. Things didn’t start returning to normal once this issue had been corrected.

There is some good news coming out of this latter situation however, as April goes on to note in the blog post:

We are currently in the middle of testing our next generation login servers, which have been specifically designed to better withstand this type of failure. We’ve had a few of the next generation login servers in the pool for the last few days just to see how they handle actual Resident traffic, and they held up really well! In fact, we think the only reason Residents were able to log in at all during this outage was because they happened to get really lucky and got randomly assigned to one of the next generation login servers that we’re testing.

Testing of the new log-in servers has yet to be completed, but April notes that the hope is they be ready for deployment soon.

Thanks once again to April for the update on the situation.

Advertisements

6 thoughts on “April Linden explains August 22nd’s Second Life woes

    1. Indeed! I was going to pass a comment in the article about cueing the Twilight Zone music at the coincidence in dates, as well :).

      Sadly, going on some of the comments in the forums, the one thing that hasn’t changed is people’s willingness to immediately jump to conclusions and make assumptions. Hence why April’s blogs are appreciated.

      Liked by 1 person

  1. Bring back some of the people currently working on Sanasar and let them resume their old jobs of keeping Linden’s cash-cow, Second Life, running the way it should. Equipment doesn’t just up and die; it sends out warnings before it goes BSOD, but if the staff managing the system is either inexperienced or understaffed, this sort of thing happens.
    Having spent quite a few years with computing and systems maintenance, I’ve been waiting for something like this to happen. So Sansar takes slx months longer, so what? Without the cash flow from Second Life, Sansar will remain a dream and no more.
    I left my last job when management started deferring maintenance and repairing rather than maintaining as a means of keeping their system going. When they went to salvaging parts from dead servers and desk-tops, I started looking for work elsewhere. I didn’t want to be there when the system went down for good.
    “Be ye kind to thy donkey, for it bears you.”

    Like

    1. “Bring back some of the people currently working on Sanasar and let them resume their old jobs of keeping Linden’s cash-cow, Second Life, running the way it should.”

      This pre-supposes that those working on Sansar are a) former SL experts; b) are now exclusively working on Sansar. While it is true that some of the expertise we’ve been familiar with on Second Life (Monty, Runitai, Nix to name three) are working far more on Sansar than SL, it’s equally true that where those with both SL and Sansar expertise are also fluid, moving between the two as required. This is likely to be especially true where Landon’s Ops team is concerned.

      As I’ve reported in my weekly updates, the Lab actually is investing in hardware and infrastructure for Second Life – and then recently indicated that the rest of this year will be a continuation of this focus.

      Could the hardware failure yesterday have been avoided? Possibly – I’m not an expert on these things. But to jump from that to the idea that it’s a result of a lack of care and attention for SL is stretching things. As you say, Second Life is the Lab’s major revenue earner – ergo, it makes sense for them to ensure it stays that way through investment, maintenance and update. Anything else, for at lest the next few years, if not longer, simply would not be logical.

      Like

  2. Linden Labs, I for one appreciate the wonderful service and efforts on behalf of a sometimes seemingly unappreciative clientele. Perhaps the nay-sayers speak louder than those who appreciate and understand. Thank you. All of you!

    Liked by 1 person

Have any thoughts?

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s