SL projects update week 41 / 1

Server Deploys

The main channel today received the same maintenance release made last week to BlueSteel (and which was used to help fix the LeTigre problems). speaking at the Server / Simulator User Group, Simon Linden described the deployment as, “A very minor change … from one of the RCs. It’s really not any functional change but was needed to go grid-wide before another one goes into RC.”

Following week 40’s issues with LeTigre, the three primary Release Candidate channels will be getting updated as follows:

  • Magnum, which has code to help sims run better on new hardware,  will be getting the same change that went to the main channel
  • BlueSteel and LeTigre will get the same update, which is described as “pretty minor but important”, being a back-end query optimisation designed to assist with database loads.

Currently, the prim accounting issue is still being worked on, which affecting scheduling what will be available in deployments. Passing comment on how deployment packages are put together, Simon explained that, as a rule, bug fixes tend to be put together as far as possible, prior to going to QA and then to an RC channel. He went on to say that what goes into other releases can be variable, “There’s a judgement call on how much gets bundled together, and a bunch of things go into the decision, like how overloaded the QA guys are, how many other things are trying to get to RC, risks of one part blocking it (like now), stuff like that”. In the meantime, LL are actively looking at ways to both prevent a recurrence of this problem and to improve the RC channel deployments as a whole.

SL Viewer

The promised new beta release (3.4.1.265642) reached the release point on Monday October 8th. This release sees tcmalloc disabled once more,  but otherwise appears to be the same code as the previous beta. It is intended to be a further stability testing / confirmation release, and as such will remain available for the next couple of days as Linden Lab gather data on its performance.Tcmalloc has been disabled, rather than removed, as it has apparently been useful in helping to trace issues within the viewer code, and LL wished to retain the ability to re-enable it in case they needed to re-enable it to help identify problems within the viewer in future.

Assuming this release proves stable, and assuming that plans outlined by Oz Linden have not undergone significant change, it should clear the way for the unblocking of various code merges that have been awaiting the stability / memory leak issue to be resolved. As previously reported, the precise order in which code merges will be made / released is unclear, but Linden Lab have a significant amount of updates waiting in the wings, including Steam support changes, Monty Linden’s HTTP library updates, Baker Linden’s Group Services project code, Apple OSX Mountain Lion support (including gatekeeper compatibility), and more.

Steam updates – one of the viewer merges waiting to be released

Under the original plans for the beta viewer, project viewer code was to start merging into the viewer with the 3.4.2 release code. As there was no OpenDev meeting on Monday 8th October, it is assumed this is still the case, however, the precise order of the merges is due for discussion this week within the Lab, and a clearer indication of the order may be available by the Thursday OpenDev meeting, and will be given in part 2 of this report, if that is the case.

Group Services Project

Due to the problems experienced with the leTigre deployment in week 40, Baker Linden’s Group Services code did not receive a proper deployment to a Release Channel. It is not due to be released this week, but should be in an RC deployment aimed at week 42, although as it has been bundled with the LeTigre deployment which had problems, this may be delayed further while the prim accounting error is looked into.

It is currently unclear as to whether the delay with the RC roll-out will influence when the Group Service viewer code (currently available in a dedicated project viewer) will be merged into the development  / beta branches of the SL viewer code; again more should be known on this following Thursday’s OpenDev meeting.

Materials Processing

Continues to progress, with little to report at this time. The feature set for the initial release still has yet to be published, and the wiki page for materials processing is due for further update. Concerns were raised over one statement relating to the use of colour, to whit:

Color a solid color for the surface; not used if a Texture is also specified.”

The concern was that whereas it is currently possible to specify both a texture and a colour for a given object or object face, the wiki implies that under material processing, it will become either / or. However, this appears to be an error in the wiki, and both options will remain available.

Linkability Bug

As reported in my last update, and while not strictly a project, the bug which is currently allowing prims to be linked over distances greater than 54m has been investigated, and a fix is expected to be rolled-out to the RC channels for this in week 42.

 

The “LeTigre event” and seeking to safeguard deployments

As reported last week, Wednesday October 3rd saw a massive problem hit the LeTigre Release Candidate channel, which impacted over 1200 regions. This most visibly manifested itself as a large number of items (including partial builds) being returned to people’s inventories as a result of regions being seen as “full” by the software as a result of an error in the prim accounting code. This saw disruption across the grid throughout Wednesday and into Thursday, partially because those regions impacted by the error not only required a corrective deployment of RC code (from BlueSteel), but also had to be manually restored to a state prior to the LeTigre deployment occurring.

Since the problem occurred, Linden Lab has not only been looking into the bug within the prim accounting software, but also at their internal processes in terms of why the error wasn’t picked-up prior to the LeTigre deployment going ahead, and also in terms of what steps can be taken to curtail such a massive disruption in the future, should ever a similar problem occur, and how regions can be restored in a less manually intensive manner. Even so, sorting out a solution which fits every possible scenario by which a deployment may go wrong isn’t easy.

Speaking at the Sim  / Server User Group meeting, Simon Linden commented on the matter thus, “The tough thing with SL testing is running it on all those combinations – it’s just never possible to have complete test coverage. The RC channels are actually designed to be representative of the whole grid, so we try to keep a mix of the different types of regions like full, mainland, Linden Homes, etc. On one hand, the system worked … we found a problem before it got to the whole grid, which might have been how this would have happened before we started the Release Channels. But it really was so bad we definitely want to catch something like that even earlier. So … we’ve had a bunch of meetings discussing what and why it happened, and have some better tests added to the regular test pass so this specific problem won’t happen again.”

Options to prevent a similar issue occurring again in the future which have apparently been discussed at these meetings include:

  • Improving the testing carried on Aditi. Part of the problem here is that as a representative testing environment, Aditi is very much smaller and much less diverse than the main grid, and as such it is harder to test for all possible failure conditions which may occur when deploying code to the main grid
  • Adding alarms to the deployment process so that when things do go wrong, such as a large number of object returns occurring, the process will automatically stop itself before the damage becomes widespread
  • Altering the deployment process so that code is initially rolled out to a subset of each Release Channel, prior to it being paused for a few hours to see if there are any reports from users of unexpected or undesirable results , and only resuming the deployments if it appears nothing untoward has happened
  • While it is the first time this particular problem has occurred in terms of selective region object returns, it has prompted the Lab into looking at ways and means to initiated an automated restore process in order to make the rollback of affected regions more time-efficient and less intensive.

It remains to be seen which of these – and any other ideas –  which have been discussed at the Lab are implemented. However, it should be remembered that even with the best will in the world, and given the dynamic nature of Second Life with all the user-created content and scripting, it is impossible for Linden Lab to take into account every single possible error which may occur with a server deployment, and provide a means of avoiding it. Even so, as a result of the LeTigre event, LL are looking to further improve how server code is both tested and deployed in the future, and provide the means to better flag any negative impact occurring during a deployment in order to allow remedial action to be determined and actioned in a more timely manner.

With thanks to Baz deSantis.