The “LeTigre event” and seeking to safeguard deployments

As reported last week, Wednesday October 3rd saw a massive problem hit the LeTigre Release Candidate channel, which impacted over 1200 regions. This most visibly manifested itself as a large number of items (including partial builds) being returned to people’s inventories as a result of regions being seen as “full” by the software as a result of an error in the prim accounting code. This saw disruption across the grid throughout Wednesday and into Thursday, partially because those regions impacted by the error not only required a corrective deployment of RC code (from BlueSteel), but also had to be manually restored to a state prior to the LeTigre deployment occurring.

Since the problem occurred, Linden Lab has not only been looking into the bug within the prim accounting software, but also at their internal processes in terms of why the error wasn’t picked-up prior to the LeTigre deployment going ahead, and also in terms of what steps can be taken to curtail such a massive disruption in the future, should ever a similar problem occur, and how regions can be restored in a less manually intensive manner. Even so, sorting out a solution which fits every possible scenario by which a deployment may go wrong isn’t easy.

Speaking at the Sim  / Server User Group meeting, Simon Linden commented on the matter thus, “The tough thing with SL testing is running it on all those combinations – it’s just never possible to have complete test coverage. The RC channels are actually designed to be representative of the whole grid, so we try to keep a mix of the different types of regions like full, mainland, Linden Homes, etc. On one hand, the system worked … we found a problem before it got to the whole grid, which might have been how this would have happened before we started the Release Channels. But it really was so bad we definitely want to catch something like that even earlier. So … we’ve had a bunch of meetings discussing what and why it happened, and have some better tests added to the regular test pass so this specific problem won’t happen again.”

Options to prevent a similar issue occurring again in the future which have apparently been discussed at these meetings include:

  • Improving the testing carried on Aditi. Part of the problem here is that as a representative testing environment, Aditi is very much smaller and much less diverse than the main grid, and as such it is harder to test for all possible failure conditions which may occur when deploying code to the main grid
  • Adding alarms to the deployment process so that when things do go wrong, such as a large number of object returns occurring, the process will automatically stop itself before the damage becomes widespread
  • Altering the deployment process so that code is initially rolled out to a subset of each Release Channel, prior to it being paused for a few hours to see if there are any reports from users of unexpected or undesirable results , and only resuming the deployments if it appears nothing untoward has happened
  • While it is the first time this particular problem has occurred in terms of selective region object returns, it has prompted the Lab into looking at ways and means to initiated an automated restore process in order to make the rollback of affected regions more time-efficient and less intensive.

It remains to be seen which of these – and any other ideas –  which have been discussed at the Lab are implemented. However, it should be remembered that even with the best will in the world, and given the dynamic nature of Second Life with all the user-created content and scripting, it is impossible for Linden Lab to take into account every single possible error which may occur with a server deployment, and provide a means of avoiding it. Even so, as a result of the LeTigre event, LL are looking to further improve how server code is both tested and deployed in the future, and provide the means to better flag any negative impact occurring during a deployment in order to allow remedial action to be determined and actioned in a more timely manner.

With thanks to Baz deSantis.