We’re all aware of the recent unpleasantness which hit Second Life over the past few weeks and which culminated in the chaos of Tuesday, May 20th, when the disruption not only caused issues with log-ins, but also caused both a curtailment in server-side deployments on Tuesday and a rescheduling of both deployments for the rest of the week and the postponing of a period of planned maintenance.
As noted in my week 20/2 SL projects update, Simon and Maestro Linden gave an explanation of Tuesday’s issues at the Serve Beta meeting on Thursday May 22nd. However, in a Tools and Technology blog post, Landon Linden has given a comprehensive explanation of the broader issues that have hit second Life in recent weeks.
Landon begins the post:
When I came to Linden Lab over five years ago, Second Life had gone through a period of the coveted hockey-stick growth, and we had just not kept up with the technical demands such growth creates. One or more major outages a week were common.
In my first few months at the Lab, we removed more than a hundred major single points of failure in our service, but several major ones still loomed large, the granddaddy of them all being the core MySQL database server. By late Winter 2009 we were suffering from a core database outage a few times each week.
It is that core MySQL database server that has been partially to blame for the recent problems, having hit two different fatal hardware faults which forced the Lab to stop most SL services on both occasions. As the blog post explains, work is in-hand to remove some of the risk in this database becoming a single point of failure by moving it to new hardware. This will be followed over the coming weeks and months to try to further reduce the impact of database failures.
But the MySQL issue wasn’t the only cause of problems, as Landon further explains:
A few weeks ago there was a massive distributed denial of service attack on one of our upstream service providers that affected most of their customers, including us, and inhibited the ability of some to use our services. We have since mitigated future potential impact from such an attack by adding an additional provider. There have also been hardware failures in the Marketplace search infrastructure that have impacted that site, a problem that we are continuing to work through.
He also provides further information on the issue which impacted users and services on Tuesday May 20th, expanding on that given by Simon and Maestro at the Server Beta meeting.
At that meeting, Simon briefly outlined Tuesday’s issues as being a case of the log-in server failing to give the viewer the correct token for it to connect to a region, so people actually got through the log-in phase when starting their viewer, but never connected to a region.
Landon expands on this, describing how the mechanism for handing-off of sessions from login to users’ initial regions is a decade old and relies on the generation of a unique identifier (the “token” Simon referred to). Simply put: the mechanism ran out of numbers – but did so quietly and without flagging the fact that it had. As a result, the server team took four hours to track down the problem and come up with a fix.
Referring to this particular issue, Landon goes on:
Having such a hidden fault in a core service is unacceptable, so we are doing a thorough review of the login process to determine if there are any more problems like this lurking. Our intent at this point also is to remove the identifier assignment service altogether. It not only was the ultimate source of this outage, but is also one more single point of failure that should have been dispatched long ago.
Such open honesty and transparency about technical matters is something that hasn’t really been seen from the Lab since the departure of Frank (FJ Linden) Ambrose, the Lab’s former Senior VP of Global Technology, who departed the company at the end of 2011. As such, it is an excellent demonstration of Ebbe Altberg’s promise to re-open the lines of communication between company and users, and one which is most welcome.
Kudos to Landon for his sincere apology for the disruption in services and for such a comprehensive explanation of the problems. Having such information will hopefully aid our understanding of the challenges the Lab faces in dealing with a complex set of services which is over a decade old, but which we expect to be ready and waiting for us 24/7. Kudos, again as well to Ebbe Altberg for re-opening the hailing frequencies. Long may it continue.