The “LeTigre event” and seeking to safeguard deployments

As reported last week, Wednesday October 3rd saw a massive problem hit the LeTigre Release Candidate channel, which impacted over 1200 regions. This most visibly manifested itself as a large number of items (including partial builds) being returned to people’s inventories as a result of regions being seen as “full” by the software as a result of an error in the prim accounting code. This saw disruption across the grid throughout Wednesday and into Thursday, partially because those regions impacted by the error not only required a corrective deployment of RC code (from BlueSteel), but also had to be manually restored to a state prior to the LeTigre deployment occurring.

Since the problem occurred, Linden Lab has not only been looking into the bug within the prim accounting software, but also at their internal processes in terms of why the error wasn’t picked-up prior to the LeTigre deployment going ahead, and also in terms of what steps can be taken to curtail such a massive disruption in the future, should ever a similar problem occur, and how regions can be restored in a less manually intensive manner. Even so, sorting out a solution which fits every possible scenario by which a deployment may go wrong isn’t easy.

Speaking at the Sim  / Server User Group meeting, Simon Linden commented on the matter thus, “The tough thing with SL testing is running it on all those combinations – it’s just never possible to have complete test coverage. The RC channels are actually designed to be representative of the whole grid, so we try to keep a mix of the different types of regions like full, mainland, Linden Homes, etc. On one hand, the system worked … we found a problem before it got to the whole grid, which might have been how this would have happened before we started the Release Channels. But it really was so bad we definitely want to catch something like that even earlier. So … we’ve had a bunch of meetings discussing what and why it happened, and have some better tests added to the regular test pass so this specific problem won’t happen again.”

Options to prevent a similar issue occurring again in the future which have apparently been discussed at these meetings include:

  • Improving the testing carried on Aditi. Part of the problem here is that as a representative testing environment, Aditi is very much smaller and much less diverse than the main grid, and as such it is harder to test for all possible failure conditions which may occur when deploying code to the main grid
  • Adding alarms to the deployment process so that when things do go wrong, such as a large number of object returns occurring, the process will automatically stop itself before the damage becomes widespread
  • Altering the deployment process so that code is initially rolled out to a subset of each Release Channel, prior to it being paused for a few hours to see if there are any reports from users of unexpected or undesirable results , and only resuming the deployments if it appears nothing untoward has happened
  • While it is the first time this particular problem has occurred in terms of selective region object returns, it has prompted the Lab into looking at ways and means to initiated an automated restore process in order to make the rollback of affected regions more time-efficient and less intensive.

It remains to be seen which of these – and any other ideas –  which have been discussed at the Lab are implemented. However, it should be remembered that even with the best will in the world, and given the dynamic nature of Second Life with all the user-created content and scripting, it is impossible for Linden Lab to take into account every single possible error which may occur with a server deployment, and provide a means of avoiding it. Even so, as a result of the LeTigre event, LL are looking to further improve how server code is both tested and deployed in the future, and provide the means to better flag any negative impact occurring during a deployment in order to allow remedial action to be determined and actioned in a more timely manner.

With thanks to Baz deSantis.

Lorca Linden provides data on pathfinding and simulator performance

It is fair to say that pathfinding has become one of the most controversial subjects in Second Life. While it has been the subject of a range of issues and problems, noticeably with vehicles, both before and after being deployed to the main grid, it has also become the most pointed-to bug-a-boo when people either are seeing, or believe they are seeing, simulator issues and problems.

Because of the levels of concern raised over pathfinding and its potential impact on simulator performance, Linden Lab has been carrying out comparative studies of simulator statistics recorded both before and after the pathfinding deployment. Speaking after the showing of a Designing Worlds special on pathfinding on Monday October 8th, Lorca Linden, the Pathfinding Producer at Linden Lab, gave the following high-level results of these comparisons:

Private Island simulator average sim fps:

  • Before pathfinding (Saturday, June 23, 2012): 44.43
  • After pathfiinding:
    • Dynamic pathfinding NOT enabled: 44.41
    • Dynamic pathfinding enabled, NO pathfinding objects: 44.29
    • Dynamic pathfinding enabled, at least 1 pathfinding object: 44.25
    • Dynamic pathfinding enabled, at least 10 pathfinding objects: 44.70

Mainland simulator average sim fps:

  • Before pathfinding (Saturday, June 23, 2012): 44.66
  • Dynamic pathfinding NOT enabled: 44.46
  • Dynamic pathfinding enabled, NO pathfinding objects: 44.44
  • Dynamic pathfinding enabled, at least 10 pathfinding objects: 44.79

These figures should not to be taken to mean there are no issues with pathfinding in terms of raised JIRAs relating to vehicles, etc.  However, as high-level as they are, and allowing for the size of the sample taken, they potentially show that pathfinding is not having as heavy an impact on simulators as many fear is the case. Commenting on them, Lorca said, “So in short, while there are certainly cases in which it’s possible for PF to have some performance impact, the data is showing that in the great majority of cases PF is not causing performance harm, since the grid-wide averages are within 1% pre and post PF.”

Also discussing the issue of perceived simulator performance during the Designing Worlds show itself, Falcon Linden acknowledged that part of concerns about simulator performance have been due to the way in which the Lab presented the potential for possible impact. “We didn’t communicate early on about the optimal performance of pathfinding,” he said. “We really wanted to take a conservative approach, so our communications, I think, were almost negative, in a way, where we were telling people what the worst case was, like we were making it seem that was what we expected to happen, but it wasn’t; and so people read from that that things could get bad.”

Lorca Linden (centre left) is joined by Maestro and Falcon Linden and Sandry Logan on a  Designing Worlds pathfinding special shown at the Designing Worlds studio on Monday October 8th (image courtesy of Designing Worlds / Wildstar Beaumont)

Also in the Designing Worlds programme, Lorca and Falcon, together with Maestro Linden, discuss Linden Lab’s thinking on pathfinding, why the Lab felt it to be a valuable resource to have in SL, and explain some of the additional features within it, such as the ability for pathfinding characters to navigate to a specific point in a region – or a point several regions away;  and a function which can be used both with pathfinding and independently of it (for example, it could be used within a HUD which can guide avatars to a specific store within a mall).

Overall, the programme helps to provide further insight into how pathfinding works and how it can be used, with a very practical demonstration by Sandry Logan of the Virtual Kennel Club. As such, for anyone who is curious / worried / may have a use for pathfinding, it is a recommended watch. Catch it on Designing Worlds at Tweet TV.

Related Links

Considering SL large regions

While reading the transcript of the Simulator User Group meeting of Tuesday 25th September as a part of preparing my last SL projects update, I came across an interesting exchange on the subject of large regions –  megaregions in OpenSim parlance – which gives some insight into the broad level of thinking about the platform that goes on within the Lab.

For those unfamiliar with the concept, a megaregion is essentially a number of standard 256×256 metre regions stitched together to present what appears to be a single large region. These are generally presented in terms of areas equivalent to 2×2 regions (i.e. 4 region in total) or 3×3 (equivalent to nine regions) and so on.

The Universal Campus designed for 4-region (2×2) megaregions, created by Michael Emory Cerquoni. The arrow indicates my avatar, to demonstrate the size of the build

Megaregions have been available within the OpenSim environment for the last few years, and are seen as means of providing far more space free from the terrors of region crossings, greatly facilitating a range of activities – flying, sailing, vehicle racing, etc., – although there are some limitations with them at present, which can make working with them difficult (parcel media tends to be restricted to the South-west corner “region” of a megaregion, for example, and elements such as terrain textures cannot currently be easily edited).

Second Life is very much geared to the 256×256 metre region, so it was surprising to come across a discussion on large regions in SL – and to learn that Linden Lab have in fact looked at them in some detail. The revelation came in a comment from Simon Linden, “Yeah, big regions have been a pet project of mine … unfortunately it’s an incredible hassle to get right,” he goes on to say, a short time later:

I’ve spent some serious time looking at large regions … it’s a huge project to do it right, involving a bunch of messaging changes to the viewer (like layer data, object positions, etc), region-to-region communication (all the neighbours) our back-end (the grid layout itself) … it touches almost everything in some way, which is why we’re where we are today 😛

Simon also indicated that he felt an ideal size for large regions – were they ever to happen – would be to a scale of 1 km on a side, rather than  1024m on a side (as would be the case if large regions were somehow “scaled up” from the current region size, as with OpenSim). However, this would mean breaking away from the current power of 2 approach to building Second Life, and might lead to position translation issues (as in translating the position known in one region to the relative position in a neighboring region), although Andrew Linden felt this might actually be easier to handle this in 1k blocks between neighbouring regions, rather than relaying on power of 2. When asked as to what would happen to the 24 metres per side which would be lost in scaling to 1000x1000m, rather than 1024×1024, Andrew suggested (semi-jokingly) that they’d be lost “To … boundary conditions.”

Large regions in SL would offer much to the sailing, flying, role-play and racing communities, were they possible

Were any change in region sizes to be undertaken, they would not be limited to just the simulator / server-side of things. The viewer itself is predicated on the power of 2 approach, being specifically geared to handling regions of 256m on a side (hence why megaregions in OpenSim have some limitations in terms of editing, etc.). So for large regions to work properly, it is likely that substantial changes would have to be made to the viewer – which even with the best will in the world, isn’t something which is going to happen any time soon, even were LL pondering looking beyond the theoreticals of large regions.

Nevertheless, the fact that the matter has been – and might still be – something some in Linden Lab are actively looking at, even at only the conceptual level, is interesting, and does tend to demonstrate that LL do think about the platform in somewhat radical ways.

Second Life RC channel server deploys cancelled

The server deploys planned for Wednesday September 26th have been cancelled. The news was given in a brief update to the the Sever Deploys blog post, which simply read:

UPDATE: There were blocking bugs found in both the RC’s planned for release this week. There will be no releases Wednesday morning. There will be no rolling restarts.

Oskar Linden also added a comment:

We found blocking issues during our pre-RC smoke tests. These issues will block the Wednesday morning RC releases. Regions will not be restarted.

Classified as maintenance releases, the deploys were to have included back-end configuration work designed to help SL run better on new and future hardware, and Baker Linden’s new Group Services code.

The postponement is the second time RC deploys have been cancelled in the last two weeks, with those planned for the week commencing 17th September being cancelled as a result of failing to pass QA testing.

As a result of last week’s RC cancellation, there was no main channel deploy on Tuesday 25th September. While the RC channel deploys might be rescheduled for later this week, depending on the severity of the reason for them being cancelled in the first place, if they do not take place then it is probable that there will be no main channel roll-out again next week.

JIRA: feedback from the Lab

The dust is slowly settling from the recent announcement vis the effective closure of the Public JIRA for bug and issue reporting and the implementation of the simplified Bug Tracker approach and associated changes.

Comments passed from front-line staff by Linden Lab make it reasonably clear that the new approach to bug reporting and management has impacted more than just those users who have in the past been actively and positively engaged in the Lab’s JIRA; the Lab itself is undergoing something of a shift in how issues are handled, and that is it likely to be a few weeks before matters settle down internally.

JIRA change: seen as a disappointing move by many

The Lab is also adamant that the overall aim of the change is to try an improve the utility of the bug reporting and management process from their own perspective – part of which was to eliminate the issue of having the JIRA used either as a forum for discussion and / or for posting irrelevant / angry statements, neither of which were seen as assisting the process of problem management and issue resolution. However, there has been an acknowledgement in some quarters as to whether or not the new system will increase or decrease the effectiveness of bug tracking  / management over time is an open question at the Lab, and that depending upon how the new system is seen to work over the next weeks / months, further changes may be made.

“JIRA Support Groups”

During the TPV/Dev meeting on September 7th, Oz Linden indicated that there are two “user groups” which are being established in relation to the new changes, and which the Lab will use to allow those residents with a demonstrable need to access a JIRA system and who are known to do so “responsibly” to have greater access to the new system.

Commenting after it became apparent during the meeting that some in attendance already had greater access to LL’s JIRA than others (including the ability to still comment on JIRA items), Oz said:

It should be noted that not all of you have exactly the same privileges. As part of this change [to the JIRA system] I created some access groups that do have somewhat deeper access … I haven’t actually figured out exactly what got set-up in the end … so be a little careful about asserting that, “Anyone can do such-and-such”, because if you’re in the active contributors’ group or the support helpers’ group, you have privileges other people don’t have … As I said, these changes have only been in effect less than 24 hours now [at the time of the meeting] … because there are a couple of levels of indirection involved, it’s not trivial to figure out what privileges a given person has – which is weird, but there you go … So, I have put in place a mechanism that I hope will make it easier for those of you who are actively collaborating with us on making the world better to continue doing so. It will probably take some time for all the bugs in that accommodation to be worked out.

Later in the meeting, he indicated one of these two groups, the “active contributors’ group” is being aimed towards the likes of TPV developers and those who have contributed to Second Life in terms of code and fixes, etc., in order to try to ensure they continue to have access to the new system which is beneficial to them (and more particularly, to LL) in order to better resolve bugs.

Similarly the “support helpers’ group” will be overseen by Alexa Linden and will comprise those who have demonstrated their value in assisting with the broader triage process (such as identifying duplicate issues, recognising where short-term workarounds for problems may exist, etc.).

Both groups were referred to as having greater ability to search reports in the new system, although the precise function and capabilities of these groups is liable to mature alongside the new system. While some people have already been added to the groups, this has been done as something of a “first pass” and appears to have been based upon first-hand knowledge of those involved. How additional people will be added to each of the groups is not entirely clear, although it is evident that in order to qualify for consideration, an individual must have a track record of positive and beneficial engagement in the JIRA process to triage and / or resolve issues.

Also during the meeting, Oz encouraged TPV developers who are concerned about the negative impact of the change and who have “Legitimate use cases that serve the needs of Second Life in general and Linden Lab in particular,” which may not be met by the new system, to write them up “In non-emotive form, … [but] in terms of how they are useful to Second Life residents and how they provide utility to Linden Lab … a calm exposition of the value to Linden Lab of doing something different would be.”

Forum Discussion Option

The JIRA situation was also raised at the Simulator User Group Meeting, also held on September 7th, Simon Linden put forward a suggestion that perhaps the forums could be used in some capacity. He was encouraged by those attending the meeting to pass the idea back to the Lab itself, with Toysoldier Thor suggesting a new Forum category of “Post-JIRA Forums” to facilitate general discussions. During the Content Creation User Group meeting held on the 10th September, Alexa Linden further indicated that the possible use of the forums was being considered.

Going Forward

The debate on the positive / negative aspects of this change are liable to continue for some time to come. That steps were taken to create two new “JIRA support groups” ahead of the launch of the new system tends to demonstrate that some within LL were not blind to the part played by users in the overall management and resolution of bugs. The hope appears to be that these new groups will offset the more negative aspects (lack of access, ability to contribute, etc.), presented with the launch of the new system.

Whether this proves to be the case will come down to how effectively the groups are managed, the level of access those within the groups are given, and whether or not the new system itself achieves the level of improved utility in the reporting, triaging and resolution of bugs the Lab hopes will be the case. Currently, it would appear that none of this is liable to be objectively known for the next several months.

Related Links

Linden Lab close public JIRA, launch Bug Tracker

Linden Lab today reported that they’ve effectively closed the Public JIRA system to users, and are launching a new “bug reporting project”.

The announcement, made in the Technology blog, reads:

User-submitted bug reports help improve the Second Life experience for all Residents, so we greatly appreciate all of you who take the time to provide this invaluable information to us. 

Because we want to make it even easier to report bugs, today we are making some changes that will streamline the bug reporting process, allowing us to more quickly collect information and respond to issues.

Following is a summary of the JIRA changes:

  • All bugs should now be filed in the new BUG project, using the more streamlined submission form.
  • Second Life users will only see their own reported issues.  When a Bug reaches the “Been Triaged” status, they will no longer be able to add comments to their issue.
  • Once a Bug reaches the “Accepted” or “Closed” status, it will not be updated. You can watch the Release Notes to see when and if a fix has been released for your issue.
  • Existing JIRAs will remain publicly visible. We will continue to review and work through these.

To those of you who have taken the time to alert us to bugs and provided the information we need to fix them — thank you! We hope that you will continue to help us improve Second Life, and this new process should make it easier for all of us. Ideas about how we can continue to improve the bug reporting process can be shared here.

For more information, visit:
How to report a BUG (Knowledge Base Article): 
Bug Tracker (wiki page):
Bug Tracker Status/Resolutions (wiki page)

As a part of this change the public JIRA is still browsable, but it appears the ability to comment on specific JIRA items has been turned off.

It’s hard to fathom why this has been done – and the stated reason actually makes little sense. If nothing else, the fact that users can only see the bugs they report will inevitably means that the system is liable to get flooded with duplicate entries  – far more so  than is was the case with the JIRA system. Beyond this are other aspects which seem to make this move counter-productive:

  • Users are often a part of the triage process. They can confirm when and how issues are occurring; they can test different hardware and different viewer options and ascertain if the problem is at all localised, or possible an artefact unique to the reporter’s system
  • Developers can similarly – and vastly – help the triage / resolution process, bringing their own knowledge and skills to bear on user-reported problems
  • Both users and TPV developers can speed the process on duplicate JIRA identification and cross-referencing, reducing the amount of work LL have to initially undertake.

All this move appears to do is further break another means of productive collaboration between Linden Labs and TPV developers / the user community, leaving everyone the worse off, and that in itself is hardly positive.

While there has been frustration within LL – and among those who do invest time and effort in trying to help LL deal with raised JIRAs – over the amount of (often pointless) feedback,  bickering than can occur with a particularly emotive JIRA (comments like THIS IS BAD!!!!!!! FIX IT NOW!!!!!!! certainly don’t help anyone), this move can hardly be called a proportional response to preventing such problems.

Unless there is more to come, such as TPVs at least being allowed to engage in the bug / issue reporting / triage / resolution process, there is potentially only one adjective which some might opt to apply to this move.

Asinine.