Why things went wrong recently with Second Life, by Landon Linden

secondlifeWe’re all aware of the recent unpleasantness which hit Second Life over the past few weeks and which culminated in the chaos of Tuesday, May 20th, when the disruption not only caused issues with log-ins, but also caused both a curtailment in server-side deployments on Tuesday and a rescheduling of both deployments for the rest of the week and the postponing of a period of planned maintenance.

As noted in my week 20/2 SL projects update, Simon and Maestro Linden gave an explanation of Tuesday’s issues at the Serve Beta meeting on Thursday May 22nd. However, in a Tools and Technology blog post, Landon Linden has given a comprehensive explanation of the broader issues that have hit second Life in recent weeks.

Landon begins the post:

When I came to Linden Lab over five years ago, Second Life had gone through a period of the coveted hockey-stick growth, and we had just not kept up with the technical demands such growth creates. One or more major outages a week were common.

In my first few months at the Lab, we removed more than a hundred major single points of failure in our service, but several major ones still loomed large, the granddaddy of them all being the core MySQL database server. By late Winter 2009 we were suffering from a core database outage a few times each week.

It is that core MySQL database server that has been partially to blame for the recent problems, having hit two different fatal hardware faults which forced the Lab to stop most SL services on both occasions. As the blog post explains, work is in-hand to remove some of the risk in this database becoming a single point of failure by moving it to new hardware. This will be followed over the coming weeks and months to try to further reduce the impact of database failures.

But the MySQL issue wasn’t the only cause of problems, as Landon further explains:

A few weeks ago there was a massive distributed denial of service attack on one of our upstream service providers that affected most of their customers, including us, and inhibited the ability of some to use our services. We have since mitigated future potential impact from such an attack by adding an additional provider. There have also been hardware failures in the Marketplace search infrastructure that have impacted that site, a problem that we are continuing to work through.

Landon Linden: why things went squiffy with SL
Landon Linden: explaining why SL  has suffered servere issues of late

He also provides further information on the issue which impacted users and services on Tuesday May 20th, expanding on that given by Simon and Maestro at the Server Beta meeting.

At that meeting, Simon briefly outlined Tuesday’s issues as being a case of the log-in server failing to give the viewer the correct token for it to connect to a region, so people actually got through the log-in phase when starting their viewer, but never connected to a region.

Landon expands on this, describing how the mechanism for handing-off of sessions from login to users’ initial regions is a decade old and relies on the generation of a unique identifier (the “token” Simon referred to). Simply put: the mechanism ran out of numbers – but did so quietly and without flagging the fact that it had. As a result, the server team took four hours to track down the problem and come up with a fix.

Referring to this particular issue, Landon goes on:

Having such a hidden fault in a core service  is unacceptable, so we are doing a thorough review of the login process to determine if there are any more problems like this lurking. Our intent at this point also is to remove the identifier assignment service altogether. It not only was the ultimate source of this outage, but is also one more single point of failure that should have been dispatched long ago.

Such open honesty and transparency about technical matters is something that hasn’t really been seen from the Lab since the departure of Frank (FJ Linden) Ambrose, the Lab’s former Senior VP of Global Technology, who departed the company at the end of 2011. As such, it is an excellent demonstration of Ebbe Altberg’s promise to re-open the lines of communication between company and users, and one which is most welcome.

Kudos to Landon for his sincere apology for the disruption in services and  for such a comprehensive explanation of the problems. Having such information will hopefully aid our understanding of the challenges the Lab faces in dealing with a complex set of services which is over a decade old, but which we expect to be ready and waiting for us 24/7. Kudos, again as well to Ebbe Altberg for re-opening the hailing frequencies. Long may it continue.

Related Links

8 thoughts on “Why things went wrong recently with Second Life, by Landon Linden

  1. I am most impressed and so pleased to see such progress in LL becoming open and involving the community again. Too long it has been since we have seen this and it really renews hope and excitement for the future of SL, at least for me. Kudo’s and especially thank you’s to Ebbe, Landon and all those behind and in support of this forward push towards what should have been all along.

    “Respect is earned. Honesty is appreciated. Trust is gained. Loyalty is returned.”
    Author: Unknown.

    Like

  2. When they take the time to explain such things, we have more confidence and better understanding of what the Lab goes through to provide the service. Excellent work sorting it out, working toward improvement, and making the effort to make sure we understand 😀

    Like

  3. Erm yes. Complex..yes. DDoS on upstream provider – erm who was that? As Landon says ‘their customers too’ so… Name please so I don’t spec them? And one database critical hardware fallover fine – two is heads roll time, sorry. I took that arrow once and not to the knee.
    Talking is nice – ‘mitigating future potential impacts’ is a retro progressive rock album title, sorry

    Like

  4. I only realised as I was reading this post that I’d not heard any kind of explantation regarding the recent outage. What was most surprising to me, however, was that I have learned to consider that “normal”. I’ve become so accustomed to a lack of any kind of explanation (let alone apology or acknowledgement), that I no longer expect it. So yes, this is a great example of the positive change we’re seeing. Thank you for sharing it and making the link.

    Like

  5. You guys need more redundancy or clustered services. Perhaps think of some cloud technology.

    Like

  6. sirhc; 2 stumbles along the path they are taking to update both the hardware and software on the SL back end is extremely good! But then I remember the days when the whole grid shut down to let the hippos stomp gremlins and the apes bang on things.

    As to LL communication, I am very impressed. We are still a long way from the goal of SL being *us* and the distinction between LL and resident/customer being truly recognized as 2 parts of one well-oiled machine. It seems we are moving in the right direction at last!

    Like

Comments are closed.