Why things went wrong recently with Second Life, by Landon Linden

secondlifeWe’re all aware of the recent unpleasantness which hit Second Life over the past few weeks and which culminated in the chaos of Tuesday, May 20th, when the disruption not only caused issues with log-ins, but also caused both a curtailment in server-side deployments on Tuesday and a rescheduling of both deployments for the rest of the week and the postponing of a period of planned maintenance.

As noted in my week 20/2 SL projects update, Simon and Maestro Linden gave an explanation of Tuesday’s issues at the Serve Beta meeting on Thursday May 22nd. However, in a Tools and Technology blog post, Landon Linden has given a comprehensive explanation of the broader issues that have hit second Life in recent weeks.

Landon begins the post:

When I came to Linden Lab over five years ago, Second Life had gone through a period of the coveted hockey-stick growth, and we had just not kept up with the technical demands such growth creates. One or more major outages a week were common.

In my first few months at the Lab, we removed more than a hundred major single points of failure in our service, but several major ones still loomed large, the granddaddy of them all being the core MySQL database server. By late Winter 2009 we were suffering from a core database outage a few times each week.

It is that core MySQL database server that has been partially to blame for the recent problems, having hit two different fatal hardware faults which forced the Lab to stop most SL services on both occasions. As the blog post explains, work is in-hand to remove some of the risk in this database becoming a single point of failure by moving it to new hardware. This will be followed over the coming weeks and months to try to further reduce the impact of database failures.

But the MySQL issue wasn’t the only cause of problems, as Landon further explains:

A few weeks ago there was a massive distributed denial of service attack on one of our upstream service providers that affected most of their customers, including us, and inhibited the ability of some to use our services. We have since mitigated future potential impact from such an attack by adding an additional provider. There have also been hardware failures in the Marketplace search infrastructure that have impacted that site, a problem that we are continuing to work through.

Landon Linden: why things went squiffy with SL
Landon Linden: explaining why SL  has suffered servere issues of late

He also provides further information on the issue which impacted users and services on Tuesday May 20th, expanding on that given by Simon and Maestro at the Server Beta meeting.

At that meeting, Simon briefly outlined Tuesday’s issues as being a case of the log-in server failing to give the viewer the correct token for it to connect to a region, so people actually got through the log-in phase when starting their viewer, but never connected to a region.

Landon expands on this, describing how the mechanism for handing-off of sessions from login to users’ initial regions is a decade old and relies on the generation of a unique identifier (the “token” Simon referred to). Simply put: the mechanism ran out of numbers – but did so quietly and without flagging the fact that it had. As a result, the server team took four hours to track down the problem and come up with a fix.

Referring to this particular issue, Landon goes on:

Having such a hidden fault in a core service  is unacceptable, so we are doing a thorough review of the login process to determine if there are any more problems like this lurking. Our intent at this point also is to remove the identifier assignment service altogether. It not only was the ultimate source of this outage, but is also one more single point of failure that should have been dispatched long ago.

Such open honesty and transparency about technical matters is something that hasn’t really been seen from the Lab since the departure of Frank (FJ Linden) Ambrose, the Lab’s former Senior VP of Global Technology, who departed the company at the end of 2011. As such, it is an excellent demonstration of Ebbe Altberg’s promise to re-open the lines of communication between company and users, and one which is most welcome.

Kudos to Landon for his sincere apology for the disruption in services and  for such a comprehensive explanation of the problems. Having such information will hopefully aid our understanding of the challenges the Lab faces in dealing with a complex set of services which is over a decade old, but which we expect to be ready and waiting for us 24/7. Kudos, again as well to Ebbe Altberg for re-opening the hailing frequencies. Long may it continue.

Related Links

Where there’s No Signal in Second Life

No Signal
No Signal

Out on the headland, there stands a great tower, so close to the water’s edge that when the tide is in, the only way to reach it is by a wooden walk winding through the coastal reeds.

They say that once it was a nexus, a hub for of our electronic comings and goings. Through the great dishes and between the long narrow repeaters, all our business rushed back and forth at the speed of light. Directed, amplified, boosted, beamed, messages too numerous to ever truly comprehend passed through that great tower.

No Signal
No Signal

But that was then, and this is now. The messages no longer come and go; the invisible beams of information no longer form an unseen web of lines spreading outward from its slender form, up down, left right, some passing one another so close, if they could ever have been seen, you’d swear they were touching. There is No Signal any more.

Now the tower stands alone on the headland, its great dishes broken, the clusters of microwave emitters hang forlornly by the heavy cables that once fed them power. Rust now coats the tower’s metal, and its platforms sit in disrepair, lopsided against the backdrop of the sea.

No Signal
No Signal

A ladder, as rusted as the rest of the tower, still runs up the side of the structure for those who dare to climb, the creak and groan of metal on metal an ever-present reminder of the decay that sits here.

They say there is a mystery here, waiting to be solved, that if you follow the clues, the enigma of the tower will be revealed. Perhaps the key to the riddle lies within the strange figure, one hand gripping the topmost spire of the tower tightly, their body outstretched, free hand reaching to catch … their hat? … As it is caught upon the wind.

Or perhaps the secret lay elsewhere in the tower’s slender finger. The only way to find out is to walk the headland yourself and visit the place where No Signal can now be found …

No Signal
No Signal

No Signal is the latest piece by Nessuno Myoo, currently on display at MIC Imagin@rium, curated by Mexi Lane, and open through until June 14th, 2014. be sure to grab a note card from the welcome board after you arrive. And while visiting, why not take the time to explore the new prim and mesh amphitheatre on the main island, the work of Rumegusc Altamura?

No Signal
No Signal

Related Links