Lab updates on unscheduled deployments and other issues

secondlifeAs noted in my recent SL project update, there was an unscheduled deployment to the three Agni (main) grid release candidate (RC) channels of Bluesteel, LeTigre and Magnum on Thursday, February 18th, which saw regions on these channels undergo a rolling restart. This was followed on Friday, February 19th by rolling restarts across the Main (SLS) channel.

During the Server Beta User Group (SBUG) meeting on Thursday, April 18th, Steven Linden provided some information on why a deployment was made to the RC channels, and indicated that a similar deployment would be forthcoming on the Main (SLS) channel, and promising further information would be provided once that deployment had been made:

We had an unscheduled RC deploy earlier today. It’s for a security vulnerability that was released, and we discovered that Second Life regions were vulnerable. A full public post-mortem will be coming after we deploy to the rest of the main grid. I can’t say until it goes out to the rest of Agni; I can say that it was related to region availability only…. I honestly can’t say a great deal, other than we have a fix, and that it’s coming very soon to the rest of Agni.

True to this promise, following the Main channel roll on Friday, February 19th, April Linden blogged Why the Friday Grid Roll?

The reason essentially boiled down to a vulnerability in the GNU version of Linux used to run the grid servers. The vulnerability lay within the GNU C library, commonly referred to as glibc, which if exploited could allow remote access to a devices – be it a computer, internet router, or other connected piece of equipment. It was initially discovered by Google on Tuesday, 16th February, and was labelled CVE-2015-7547.

April’s blog post provides a concise explanation of just what went into the Lab’s security and operations teams’ efforts in ascertaining SL’s exposure to the vulnerability and developing an update to secure their servers against the vulnerability.

All of this took time – but all things considered, it was still a remarkably fast effort. The Lab went from hearing about the risk on Tuesday 16th February through to understanding the full extent of the possible exposure SL faced, to having an update coded, tested and ready for release by Thursday, which as April explained, then left them with another decision:

Do we want to roll the code to the full grid at once? We decided that since the updates were to one of the most core libraries, we should be extra careful, and decided to roll the updates to the Release Candidate (RC) channels first. That happened on Thursday morning.

Given the Lab wanted to monitor how things progressed on the RC channels (which between them represent roughly 30% of the total grid), and ensure the update itself didn’t introduce anything unexpected. So it was that the deployment to the rest of the grid couldn’t be made until Friday, February 19th.

April emphasises that at no point during the known period of exposure or before, was there any attempt to use the vulnerability against the SL servers.  At the time of the Thursday roll, there was some criticism directed at the Lab for the lack of warning. April also explains why this was the case:

The reason there was little notice for the roll on Thursday is two-fold. First, we were moving very quickly, and second because the roll was to mitigate a security issue, we didn’t want to tip our hand and show what was going on until after the issue had been fully resolved.

When things like unscheduled rolls are disruptive, leaving us prone to grumbling and pointing the finger, it’s perhaps worthwhile taking this incident as an example that sometimes, there are reasons why the Lab does announced things first.

April’s post is actually one of three published recently by the operations / engineering teams which provide interesting insight into what goes on behind the scenes in keeping Second Life running.

In Recent Issues with the Nightly Biller, Steven Linden provides and explanation on why some Premium members recently experienced billing issues, up to and including inadvertently receiving delinquent balance notices. Once again, the explanation of what happened and what has been done to try to ensure a similar problem doesn’t occur in the future makes for a worthwhile read.

In Tale of the Missing ACK, Chris Linden describes another unusual and challenging incident the Lab’s engineering team had to deal with when testing a new API endpoint hosted in Amazon. This again illustrates the overall complexity of the Second Life services and infrastructure, which extends far beyond the simulator servers we some often take for granted as being “the” SL service, and the complexities involved in tracking issues down when things don’t go as expected  / planned.

Thanks again to April, Steven and Chris for providing the explanations and the insight into SL’s services.