SL projects update week 41 / 1

Server Deploys

The main channel today received the same maintenance release made last week to BlueSteel (and which was used to help fix the LeTigre problems). speaking at the Server / Simulator User Group, Simon Linden described the deployment as, “A very minor change … from one of the RCs. It’s really not any functional change but was needed to go grid-wide before another one goes into RC.”

Following week 40’s issues with LeTigre, the three primary Release Candidate channels will be getting updated as follows:

  • Magnum, which has code to help sims run better on new hardware,  will be getting the same change that went to the main channel
  • BlueSteel and LeTigre will get the same update, which is described as “pretty minor but important”, being a back-end query optimisation designed to assist with database loads.

Currently, the prim accounting issue is still being worked on, which affecting scheduling what will be available in deployments. Passing comment on how deployment packages are put together, Simon explained that, as a rule, bug fixes tend to be put together as far as possible, prior to going to QA and then to an RC channel. He went on to say that what goes into other releases can be variable, “There’s a judgement call on how much gets bundled together, and a bunch of things go into the decision, like how overloaded the QA guys are, how many other things are trying to get to RC, risks of one part blocking it (like now), stuff like that”. In the meantime, LL are actively looking at ways to both prevent a recurrence of this problem and to improve the RC channel deployments as a whole.

SL Viewer

The promised new beta release (3.4.1.265642) reached the release point on Monday October 8th. This release sees tcmalloc disabled once more,  but otherwise appears to be the same code as the previous beta. It is intended to be a further stability testing / confirmation release, and as such will remain available for the next couple of days as Linden Lab gather data on its performance.Tcmalloc has been disabled, rather than removed, as it has apparently been useful in helping to trace issues within the viewer code, and LL wished to retain the ability to re-enable it in case they needed to re-enable it to help identify problems within the viewer in future.

Assuming this release proves stable, and assuming that plans outlined by Oz Linden have not undergone significant change, it should clear the way for the unblocking of various code merges that have been awaiting the stability / memory leak issue to be resolved. As previously reported, the precise order in which code merges will be made / released is unclear, but Linden Lab have a significant amount of updates waiting in the wings, including Steam support changes, Monty Linden’s HTTP library updates, Baker Linden’s Group Services project code, Apple OSX Mountain Lion support (including gatekeeper compatibility), and more.

Steam updates – one of the viewer merges waiting to be released

Under the original plans for the beta viewer, project viewer code was to start merging into the viewer with the 3.4.2 release code. As there was no OpenDev meeting on Monday 8th October, it is assumed this is still the case, however, the precise order of the merges is due for discussion this week within the Lab, and a clearer indication of the order may be available by the Thursday OpenDev meeting, and will be given in part 2 of this report, if that is the case.

Group Services Project

Due to the problems experienced with the leTigre deployment in week 40, Baker Linden’s Group Services code did not receive a proper deployment to a Release Channel. It is not due to be released this week, but should be in an RC deployment aimed at week 42, although as it has been bundled with the LeTigre deployment which had problems, this may be delayed further while the prim accounting error is looked into.

It is currently unclear as to whether the delay with the RC roll-out will influence when the Group Service viewer code (currently available in a dedicated project viewer) will be merged into the development  / beta branches of the SL viewer code; again more should be known on this following Thursday’s OpenDev meeting.

Materials Processing

Continues to progress, with little to report at this time. The feature set for the initial release still has yet to be published, and the wiki page for materials processing is due for further update. Concerns were raised over one statement relating to the use of colour, to whit:

Color a solid color for the surface; not used if a Texture is also specified.”

The concern was that whereas it is currently possible to specify both a texture and a colour for a given object or object face, the wiki implies that under material processing, it will become either / or. However, this appears to be an error in the wiki, and both options will remain available.

Linkability Bug

As reported in my last update, and while not strictly a project, the bug which is currently allowing prims to be linked over distances greater than 54m has been investigated, and a fix is expected to be rolled-out to the RC channels for this in week 42.

 

The “LeTigre event” and seeking to safeguard deployments

As reported last week, Wednesday October 3rd saw a massive problem hit the LeTigre Release Candidate channel, which impacted over 1200 regions. This most visibly manifested itself as a large number of items (including partial builds) being returned to people’s inventories as a result of regions being seen as “full” by the software as a result of an error in the prim accounting code. This saw disruption across the grid throughout Wednesday and into Thursday, partially because those regions impacted by the error not only required a corrective deployment of RC code (from BlueSteel), but also had to be manually restored to a state prior to the LeTigre deployment occurring.

Since the problem occurred, Linden Lab has not only been looking into the bug within the prim accounting software, but also at their internal processes in terms of why the error wasn’t picked-up prior to the LeTigre deployment going ahead, and also in terms of what steps can be taken to curtail such a massive disruption in the future, should ever a similar problem occur, and how regions can be restored in a less manually intensive manner. Even so, sorting out a solution which fits every possible scenario by which a deployment may go wrong isn’t easy.

Speaking at the Sim  / Server User Group meeting, Simon Linden commented on the matter thus, “The tough thing with SL testing is running it on all those combinations – it’s just never possible to have complete test coverage. The RC channels are actually designed to be representative of the whole grid, so we try to keep a mix of the different types of regions like full, mainland, Linden Homes, etc. On one hand, the system worked … we found a problem before it got to the whole grid, which might have been how this would have happened before we started the Release Channels. But it really was so bad we definitely want to catch something like that even earlier. So … we’ve had a bunch of meetings discussing what and why it happened, and have some better tests added to the regular test pass so this specific problem won’t happen again.”

Options to prevent a similar issue occurring again in the future which have apparently been discussed at these meetings include:

  • Improving the testing carried on Aditi. Part of the problem here is that as a representative testing environment, Aditi is very much smaller and much less diverse than the main grid, and as such it is harder to test for all possible failure conditions which may occur when deploying code to the main grid
  • Adding alarms to the deployment process so that when things do go wrong, such as a large number of object returns occurring, the process will automatically stop itself before the damage becomes widespread
  • Altering the deployment process so that code is initially rolled out to a subset of each Release Channel, prior to it being paused for a few hours to see if there are any reports from users of unexpected or undesirable results , and only resuming the deployments if it appears nothing untoward has happened
  • While it is the first time this particular problem has occurred in terms of selective region object returns, it has prompted the Lab into looking at ways and means to initiated an automated restore process in order to make the rollback of affected regions more time-efficient and less intensive.

It remains to be seen which of these – and any other ideas –  which have been discussed at the Lab are implemented. However, it should be remembered that even with the best will in the world, and given the dynamic nature of Second Life with all the user-created content and scripting, it is impossible for Linden Lab to take into account every single possible error which may occur with a server deployment, and provide a means of avoiding it. Even so, as a result of the LeTigre event, LL are looking to further improve how server code is both tested and deployed in the future, and provide the means to better flag any negative impact occurring during a deployment in order to allow remedial action to be determined and actioned in a more timely manner.

With thanks to Baz deSantis.

SL projects update week 40 / 2

Server Deploys

As many are aware, there was a major error in this week’s LeTigre Release Channel deploy. Apparently, the root cause of the problem lay in the server-side prim account code, which Simon Linden describes as having “blown up” on the LeTigre RC channel. This resulted in a large number of items (including partial builds) being returned to people’s inventories as a result of regions being seen as “full”. The problem required a two-stage recovery:

  • LeTigre regions were rolled back to a state prior to the faulty deployment, and were then updated with the BlueSteel code also deployed on Wednesday October 3rd. This helped to determine the extent of the damage (a total of some 1200 regions)
  • The regions damaged by the land impact miscalculation were then restore to a state prior to the roll-out of the original faulty LeTigre code. These had to be restored manually, which took a considerable time

There is further post-mortem work going to to try and discover why this error did not reveal itself when the code deployed to LeTigre was being tested on Aditi, and whether there is anything specific to the regions impacted by the error which may have triggered it. Thought is now also being given to managing large scale region restorations, despite this being the first time there has been such a massive issue of this kind occurring across the grid.

Current RC plans for next week call for the same maintenance release to be made to all three RC channels, which Simon Linden describes as, “Mostly internal changes but [which] does include a minor update for the physics engine library … It’s almost all updating libraries … we’ve been using a fairly old set of compilers and such to make some of the development builds of the servers, and this brings us to more recent code.” Further details on the deploy should be available next week in the Second Life Server section of the Technology forum.

SL Viewer

As indicated in part one of this report earlier this week, problems have continued with the Beta viewer code and high crash rates. Work has been ongoing to try and locate the probable cause(s), some of which included the temporary return of tcmalloc. While not actually a cause of the crash issues, having tcmalloc disabled was affecting efforts to reproduce the problems. a beta release was made on the 3rd/4th October (3.4.1.265434), which is proving to be a lot more stable than previous versions, and which happens to have tcmalloc enabled.

The current plan is for a further beta release to be made, most likely on Monday 8th October, which should see tcmalloc turned off once more (if not removed). Should this also prove to be stable, the fixes it contains will be merged back into the development viewer code, and this will clear the way for clearing the backlog of code merges for both the beta and development viewers. It may also see a further 3.4.1 release version of the viewer being made.

Among the projects awaiting merging into the development and beta viewer code are:

  • The Steam support changes, which have been available within a development viewer stream, and which are described as “mostly cosmetic”. There is apparently a version of the viewer on Steam, but it is not available for general viewing / download, and is presumably there for testing purposes
  • Monty Linden’s HTTP library (texture fetch) code
  • Baker Linden’s Group Services project code
  • Apple OSX 10.8 Mountain Lion support work, including gatekeeper compatibility
  • Bug fixes and further regionalisation work.

Previous plans for these releases called for them to be made under the 3.4.2 code base. While this wasn’t discussed at the TPV/Dev meeting, one assumes this is still the case. However, speaking at the TPV Dev meeting on Friday October 5th, Oz Linden indicated that the order, etc., in which waiting merges will be cleared hasn’t been fully defined, and will be the subject of internal conversations next week at the Lab.

Avatar Baking Project

Bake fail: a familiar problem for many

There is still no major news on this project, although work is continuing both on the viewer and on the server code.

The plan remains to provide TPV developers with access to the viewer code at least 8 weeks ahead of any initial deployment of the server-side code to an Agni release channel. This is to allow TPVs time to merge the code into their viewers and participate in ongoing testing of the new service.There is a possibility that that viewer code will be available sufficiently well ahead of things in order for TPVs to be able to use it alongside the testing on Aditi (beta grid), depending on the status of the beta grid tests and how development of the viewer code progresses.

Please use the page numbers below left to continue reading this report

Second Life RC channel server deploys cancelled

The server deploys planned for Wednesday September 26th have been cancelled. The news was given in a brief update to the the Sever Deploys blog post, which simply read:

UPDATE: There were blocking bugs found in both the RC’s planned for release this week. There will be no releases Wednesday morning. There will be no rolling restarts.

Oskar Linden also added a comment:

We found blocking issues during our pre-RC smoke tests. These issues will block the Wednesday morning RC releases. Regions will not be restarted.

Classified as maintenance releases, the deploys were to have included back-end configuration work designed to help SL run better on new and future hardware, and Baker Linden’s new Group Services code.

The postponement is the second time RC deploys have been cancelled in the last two weeks, with those planned for the week commencing 17th September being cancelled as a result of failing to pass QA testing.

As a result of last week’s RC cancellation, there was no main channel deploy on Tuesday 25th September. While the RC channel deploys might be rescheduled for later this week, depending on the severity of the reason for them being cancelled in the first place, if they do not take place then it is probable that there will be no main channel roll-out again next week.

Server roll-outs w/c 30th July

Oskar has issued a notification of the planned server roll-outs for this week. As they currently stand, the roll-out will comprise:

Main Channel: Sever release 12.07.24.262437 – Tuesday 31st July

This should see a further roll-out of the LSL functions related to the Advanced Creator Tools. This release will see the addition of three new LSL functions:

These new LSL functions work with the current runtime permissions system, and are precursor to future work with experience permissions. More information about the runtime permission is here:PERMISSION_TELEPORT.

This is a roll-out of the code deployed to LeTigre and BlueSteel last week. As with both of those channels last week, the code will be enabled on the main channel regions following the deploy (although LL retain the capability to disable it).

Magnum RC: Further Pathfinding Roll-out – Wednesday August 1st

Roll-out due to commence: 07:00 SLT

A further roll-out of the server-side pathfinding code, with fixes. Currently the wiki notes for this channel appear to be stalled on the 12.07.24.262484 release.

Note that the viewer-side pathfinding tools are now available in the latest Development Viewer.

BlueSteel RC – Wednesday August 1st

Re-start due to commence: 08:30 SLT

There are no changes to this channel. It will have the same code as the main channel.

LeTigre RC: Infrastructure Project update – Wednesday August 1st

Roll-out due to commence: 09:30 SLT

Oskar comments: “This channel will have an infrastructure project that has no intentional changes to existing behaviour. There are perhaps unintentional changes to existing behaviour. If you find some please let us know!”

SL Server roll-outs: creator tools and pathfinding

Update July 18th: The Magnum RC roll-out has been delayed until Thursday July 19th. Oskar may supply a reason on the deployment thread in the forums – keep an eye on that for updates (with thanks to Wolf Baginski).

Main Channel Release

Tuesday 17th July sees the a roll-out of LSL functions related to the Advanced Creator Tools. This release will see the addition of three new LSL functions (comments taken from the release notes):

  • llAttachToAvatarTemp(integer attach_point): Follows the same convention as llAttachToAvatar, with the exception that the object will not create inventory for the user, and will disappear on detach, or disconnect. It should be noted that when an object is attached temporarily, a user cannot ‘take’ or ‘drop’ the object that is attached to them. The user is ‘automatically’ made the owner of the object. Temporary attached items cannot use the llTeleportAgent or llTeleportAgentGlobalCoords LSL functions
  • llTeleportAgent(key agent_uuid, string lm_name, vector landing_point, vector look_at_point): Teleport Agent allows the script to teleport an agent to either a local coordinate in the current region or to a remote location specified by a landmark. If the destination is local, the lm_name argument is a blank string. The landing point and look at point are respected for this call. If the destination is remote, the object must have a landmark in its inventory with the teleport agent script. lm_name refers to the name of the landmark in inventory. This function cannot be used in a script in an object attached using llAttachToAvatarTemp
  • llTeleportAgentGlobalCoords(key avatar, vector global_coordinates, vector region_coordinates, vector: Teleports an agent to region_coordinates within a region at the specified global_coordinates. The agent lands facing the position defined by look_at local coordinates. A region’s global coordinates can be retrieved using llRequestSimulatorData(region_name, DATA_SIM_POS). This function cannot be used in a script in an object attached using llAttachToAvatarTemp.

The new LSL functions work with the current runtime permissions system and are precursor to future work with experience permissions. More information about the runtime permission is here:PERMISSION_TELEPORT.

The keen-eyed will note that these are the functions that were rolled-out to the Magnum RC channel in May, and which were subsequently abused for griefing purposes. However, Linden Lab have added a new capability to the functions  – what is described as an “on / off” switch which is available only to Linden Lab personnel, and which allows the functions to be enabled  / disabled (the functions were also rolled-out to the Le Tigre RC on July 11th with the “on / off” switch capability). As the release notes make clear, the functions are disabled by default in the roll-out, and will presumably remain that way until such time as the updated permissions system has been rolled-out.

The release also includes three bug fixes (again, as specified in the release notes):

  • SCR-342: llTeleportAgent() does not fail gracefully when specifying an invalid landmark name
  • SVC-7966: Magnum RC, llTeleportAgent gives a wrong message
  • SVC-7987: llTeleportAgent always points in the positive Y direction on teleport.

Pathfinding release: Magnum and Le Tigre

On Wednesday 18th July, the Magnum RC will get a further roll of the pathfinding code and Le Tigre will apparently get the same code as well. At the time of writing, the actual release note pages on the SL wiki for Magnum and Le Tigre still reflected the releases for July 11th and the forum post announcing the release did not show any specific changes from the forum post relating to the July 11th release. Any alternations which may have been made following the difficulties some initially encountered on the Magnum RC following that roll-out are therefore hard to identify. This ma change prior to the actual roll-out.

Related Links