Piling it on: the network optimisation tests

Thursday October 11th saw a huge response to Oskar Linden’s request for assistance with network optimisation tests, with many people logging-in to Aditi to join is Beta User Group meeting (I actually made it for the first time myself, the time of his meeting is generally a little awkward for me). More were available on the IRC channel established for the test as well.

Oskar’s meeting place at Morris on Aditi.

Things got off to a rocky start; mid-way through the Beta UG meeting everyone received the royal order of the boot, and problems occurred attempting to log back in. It transpired that an SSL certificate had expired at LL’s end and had to be renewed (through until 2015). Even so, not everyone appeared to make it back (or at least, not with their primary accounts!). Maestro Linden did make it back with the rest of us, and immediately sought protection in a state-of-the-art anti-crash system from Ordinal Malaprop* created (or is that crated?). No amount of coaxing could get him out, either:

Darien Caldwell: you can come out of the box now Maestro. Crash is over ;p

Mæstro Linden (maestro.linden): I’ll come out when I feel safe 🙂

People getting back after the crash and finding we have a Maestro Linden-in the-box

The meeting also had some disruption from an unhappy camper or two complaining about bans. One of them made it back following the initial forced log-out, and as final preparations were made for the test, appeared to successfully crash the region. Shortly before this happened, concerns were raised that this individual may have been trying to disrupt the IRC test channel, as they appeared to be passing commands aimed at IRC in local chat (at one point a little later, a similar command appeared in the IRC channel, and I and a number of others were, coincidentally or otherwise, disconnected).

The testing itself proceeded pretty much as planned, with everyone logging-in to a specified region at more-or-less the same time, testing the network capabilities in handling a large number of log-in updates in a single region. From my perspective, this went well, and as one of the initial people to log-in, I didn’t appear to suffer from the kind of lag usually associated with moving around in a region where there are a large number of people arriving.

Following the en-masse arrival, we dispersed to two regions for a group chat load test. I cannot actually say how this went, as I arrived at my designated region, only to take three steps and crash (an issue at my end of the SL equation rather than anything else).

I made it back to participate in the IM tests, which comprised piling-on a mass of IMs to targeted avatars and then awaiting their reply. I think I was one of the first to IM and one of the last to get a reply, Again, not through any failure in the system, but simply because Pey’s Law affected the tester I was IMing – he replied on receiving my first message, but forgot to press ENTER to send :).

The final part of the test was a mass teleport to a specified region, again presumably to test how the network handled a large number of arrivials within a region. While this may have been a placebo effect from being on Aditi, the teleport itself seemed to me to be somewhat faster than is usual, with the progress bar merely flicking up on the screen and then vanishing as I arrived. Once, there I also found walking around with people people teleporting-in also did seem to be as prone to mini-freezes or stutters as can be the case. However, the load on the target region (selected at the last-minute due to problems with the intended destination) may have been lighter than hoped, as it had a cap of 21 on the number of avatars allowed into it, and a number of people did report they were unable to teleport-in as a result (there were probably around 40-50 listed on the IRC chat page).

Pile-on Test Medal

Overall the tests made for a fun social gathering, with a lot of good humour all around, and Oskar and his team apparently gathering the data they wanted.

Hopefully, there will be further follow-up on the overall intent of the tests and the results in an upcoming Sim / Server UG meeting. Oskar certainly appeared pleased with the outcome, and was on the main grid after the tests to hand out medals to the participants (providing they knew the sekrit password! 🙂 ).

 * With thanks to DD Ra for pointing this out; I missed checking the creator details earlier.

Network Optimisation: LL seek assistance

Linden Lab are in the process of changing the manner in which network traffic is handled within Second Life, and require assistance in testing the changes made to date in order to ensure various services are functioning correctly prior to rolling-out the changes to the grid as a whole.

To this end, Oskar Linden posted the following request in the Server forum late on Wednesday 10th October:

Linden Lab has made some changes to the way regions handle network traffic. We need your help to test. This will help us insure that regions are communicating appropriately over the network. Mainly we are concerned with agents entering via direct login, teleport, and region crossing. As well as other functions such as IMs, Voice, and Group chat.

Tomorrow, October 11th, at 4PM Pacific time (right after the server beta user group) we will conduct these tests with as many people as we have. Testing will take place on ADITI and require out of world communication we will be coordinating via IRC. This will require an IRC client connected to EFNET in the channel #sltest. If you don’t know how to do that you have until tomorrow afternoon to figure it out. 🙂

The details on the tests are here:

 – https://wiki.secondlife.com/wiki/Networking_Optimization_Pile_On_Tests

I hope you can come and help us test these new changes.

__Oskar

The tests will comprise three parts:

  • Direct log-in test to a pre-determined location on the beta grid.
  • A group chat test via the Second Life Beta in-world group.
  • A teleport test to a pre-determined location.

Note that Voice is also indicated as one of the services to be tested, but no details on what this will entail have as yet been included in the test notes – please check both the note and Oskar’s forum thread for possible updates on this ahead of testing.

Those wishing to take part in the tests will need to:

  • Be members of the Second Life Beta in-world group
  • Be able to log-in to the Aditi beta grid

The tests will be coordinated on IRC using the EFnet channel #sltest, and those involved in the tests will need to be able to access this channel either via the EFnet website or through an IRC client.

Accessing the #sltest channel on the EFnet website – note you do not require a registered account; you can access the channel using any suitable nickname

The tests are due to commence at 16:00 SLT.

Note that these changes are not related to the region lag issue sudden and massive lag spikes, as reported in the Server forum threads, but rather appear to be part of ongoing network-related work.

The “LeTigre event” and seeking to safeguard deployments

As reported last week, Wednesday October 3rd saw a massive problem hit the LeTigre Release Candidate channel, which impacted over 1200 regions. This most visibly manifested itself as a large number of items (including partial builds) being returned to people’s inventories as a result of regions being seen as “full” by the software as a result of an error in the prim accounting code. This saw disruption across the grid throughout Wednesday and into Thursday, partially because those regions impacted by the error not only required a corrective deployment of RC code (from BlueSteel), but also had to be manually restored to a state prior to the LeTigre deployment occurring.

Since the problem occurred, Linden Lab has not only been looking into the bug within the prim accounting software, but also at their internal processes in terms of why the error wasn’t picked-up prior to the LeTigre deployment going ahead, and also in terms of what steps can be taken to curtail such a massive disruption in the future, should ever a similar problem occur, and how regions can be restored in a less manually intensive manner. Even so, sorting out a solution which fits every possible scenario by which a deployment may go wrong isn’t easy.

Speaking at the Sim  / Server User Group meeting, Simon Linden commented on the matter thus, “The tough thing with SL testing is running it on all those combinations – it’s just never possible to have complete test coverage. The RC channels are actually designed to be representative of the whole grid, so we try to keep a mix of the different types of regions like full, mainland, Linden Homes, etc. On one hand, the system worked … we found a problem before it got to the whole grid, which might have been how this would have happened before we started the Release Channels. But it really was so bad we definitely want to catch something like that even earlier. So … we’ve had a bunch of meetings discussing what and why it happened, and have some better tests added to the regular test pass so this specific problem won’t happen again.”

Options to prevent a similar issue occurring again in the future which have apparently been discussed at these meetings include:

  • Improving the testing carried on Aditi. Part of the problem here is that as a representative testing environment, Aditi is very much smaller and much less diverse than the main grid, and as such it is harder to test for all possible failure conditions which may occur when deploying code to the main grid
  • Adding alarms to the deployment process so that when things do go wrong, such as a large number of object returns occurring, the process will automatically stop itself before the damage becomes widespread
  • Altering the deployment process so that code is initially rolled out to a subset of each Release Channel, prior to it being paused for a few hours to see if there are any reports from users of unexpected or undesirable results , and only resuming the deployments if it appears nothing untoward has happened
  • While it is the first time this particular problem has occurred in terms of selective region object returns, it has prompted the Lab into looking at ways and means to initiated an automated restore process in order to make the rollback of affected regions more time-efficient and less intensive.

It remains to be seen which of these – and any other ideas –  which have been discussed at the Lab are implemented. However, it should be remembered that even with the best will in the world, and given the dynamic nature of Second Life with all the user-created content and scripting, it is impossible for Linden Lab to take into account every single possible error which may occur with a server deployment, and provide a means of avoiding it. Even so, as a result of the LeTigre event, LL are looking to further improve how server code is both tested and deployed in the future, and provide the means to better flag any negative impact occurring during a deployment in order to allow remedial action to be determined and actioned in a more timely manner.

With thanks to Baz deSantis.

Lorca Linden provides data on pathfinding and simulator performance

It is fair to say that pathfinding has become one of the most controversial subjects in Second Life. While it has been the subject of a range of issues and problems, noticeably with vehicles, both before and after being deployed to the main grid, it has also become the most pointed-to bug-a-boo when people either are seeing, or believe they are seeing, simulator issues and problems.

Because of the levels of concern raised over pathfinding and its potential impact on simulator performance, Linden Lab has been carrying out comparative studies of simulator statistics recorded both before and after the pathfinding deployment. Speaking after the showing of a Designing Worlds special on pathfinding on Monday October 8th, Lorca Linden, the Pathfinding Producer at Linden Lab, gave the following high-level results of these comparisons:

Private Island simulator average sim fps:

  • Before pathfinding (Saturday, June 23, 2012): 44.43
  • After pathfiinding:
    • Dynamic pathfinding NOT enabled: 44.41
    • Dynamic pathfinding enabled, NO pathfinding objects: 44.29
    • Dynamic pathfinding enabled, at least 1 pathfinding object: 44.25
    • Dynamic pathfinding enabled, at least 10 pathfinding objects: 44.70

Mainland simulator average sim fps:

  • Before pathfinding (Saturday, June 23, 2012): 44.66
  • Dynamic pathfinding NOT enabled: 44.46
  • Dynamic pathfinding enabled, NO pathfinding objects: 44.44
  • Dynamic pathfinding enabled, at least 10 pathfinding objects: 44.79

These figures should not to be taken to mean there are no issues with pathfinding in terms of raised JIRAs relating to vehicles, etc.  However, as high-level as they are, and allowing for the size of the sample taken, they potentially show that pathfinding is not having as heavy an impact on simulators as many fear is the case. Commenting on them, Lorca said, “So in short, while there are certainly cases in which it’s possible for PF to have some performance impact, the data is showing that in the great majority of cases PF is not causing performance harm, since the grid-wide averages are within 1% pre and post PF.”

Also discussing the issue of perceived simulator performance during the Designing Worlds show itself, Falcon Linden acknowledged that part of concerns about simulator performance have been due to the way in which the Lab presented the potential for possible impact. “We didn’t communicate early on about the optimal performance of pathfinding,” he said. “We really wanted to take a conservative approach, so our communications, I think, were almost negative, in a way, where we were telling people what the worst case was, like we were making it seem that was what we expected to happen, but it wasn’t; and so people read from that that things could get bad.”

Lorca Linden (centre left) is joined by Maestro and Falcon Linden and Sandry Logan on a  Designing Worlds pathfinding special shown at the Designing Worlds studio on Monday October 8th (image courtesy of Designing Worlds / Wildstar Beaumont)

Also in the Designing Worlds programme, Lorca and Falcon, together with Maestro Linden, discuss Linden Lab’s thinking on pathfinding, why the Lab felt it to be a valuable resource to have in SL, and explain some of the additional features within it, such as the ability for pathfinding characters to navigate to a specific point in a region – or a point several regions away;  and a function which can be used both with pathfinding and independently of it (for example, it could be used within a HUD which can guide avatars to a specific store within a mall).

Overall, the programme helps to provide further insight into how pathfinding works and how it can be used, with a very practical demonstration by Sandry Logan of the Virtual Kennel Club. As such, for anyone who is curious / worried / may have a use for pathfinding, it is a recommended watch. Catch it on Designing Worlds at Tweet TV.

Related Links

Considering SL large regions

While reading the transcript of the Simulator User Group meeting of Tuesday 25th September as a part of preparing my last SL projects update, I came across an interesting exchange on the subject of large regions –  megaregions in OpenSim parlance – which gives some insight into the broad level of thinking about the platform that goes on within the Lab.

For those unfamiliar with the concept, a megaregion is essentially a number of standard 256×256 metre regions stitched together to present what appears to be a single large region. These are generally presented in terms of areas equivalent to 2×2 regions (i.e. 4 region in total) or 3×3 (equivalent to nine regions) and so on.

The Universal Campus designed for 4-region (2×2) megaregions, created by Michael Emory Cerquoni. The arrow indicates my avatar, to demonstrate the size of the build

Megaregions have been available within the OpenSim environment for the last few years, and are seen as means of providing far more space free from the terrors of region crossings, greatly facilitating a range of activities – flying, sailing, vehicle racing, etc., – although there are some limitations with them at present, which can make working with them difficult (parcel media tends to be restricted to the South-west corner “region” of a megaregion, for example, and elements such as terrain textures cannot currently be easily edited).

Second Life is very much geared to the 256×256 metre region, so it was surprising to come across a discussion on large regions in SL – and to learn that Linden Lab have in fact looked at them in some detail. The revelation came in a comment from Simon Linden, “Yeah, big regions have been a pet project of mine … unfortunately it’s an incredible hassle to get right,” he goes on to say, a short time later:

I’ve spent some serious time looking at large regions … it’s a huge project to do it right, involving a bunch of messaging changes to the viewer (like layer data, object positions, etc), region-to-region communication (all the neighbours) our back-end (the grid layout itself) … it touches almost everything in some way, which is why we’re where we are today 😛

Simon also indicated that he felt an ideal size for large regions – were they ever to happen – would be to a scale of 1 km on a side, rather than  1024m on a side (as would be the case if large regions were somehow “scaled up” from the current region size, as with OpenSim). However, this would mean breaking away from the current power of 2 approach to building Second Life, and might lead to position translation issues (as in translating the position known in one region to the relative position in a neighboring region), although Andrew Linden felt this might actually be easier to handle this in 1k blocks between neighbouring regions, rather than relaying on power of 2. When asked as to what would happen to the 24 metres per side which would be lost in scaling to 1000x1000m, rather than 1024×1024, Andrew suggested (semi-jokingly) that they’d be lost “To … boundary conditions.”

Large regions in SL would offer much to the sailing, flying, role-play and racing communities, were they possible

Were any change in region sizes to be undertaken, they would not be limited to just the simulator / server-side of things. The viewer itself is predicated on the power of 2 approach, being specifically geared to handling regions of 256m on a side (hence why megaregions in OpenSim have some limitations in terms of editing, etc.). So for large regions to work properly, it is likely that substantial changes would have to be made to the viewer – which even with the best will in the world, isn’t something which is going to happen any time soon, even were LL pondering looking beyond the theoreticals of large regions.

Nevertheless, the fact that the matter has been – and might still be – something some in Linden Lab are actively looking at, even at only the conceptual level, is interesting, and does tend to demonstrate that LL do think about the platform in somewhat radical ways.

Second Life RC channel server deploys cancelled

The server deploys planned for Wednesday September 26th have been cancelled. The news was given in a brief update to the the Sever Deploys blog post, which simply read:

UPDATE: There were blocking bugs found in both the RC’s planned for release this week. There will be no releases Wednesday morning. There will be no rolling restarts.

Oskar Linden also added a comment:

We found blocking issues during our pre-RC smoke tests. These issues will block the Wednesday morning RC releases. Regions will not be restarted.

Classified as maintenance releases, the deploys were to have included back-end configuration work designed to help SL run better on new and future hardware, and Baker Linden’s new Group Services code.

The postponement is the second time RC deploys have been cancelled in the last two weeks, with those planned for the week commencing 17th September being cancelled as a result of failing to pass QA testing.

As a result of last week’s RC cancellation, there was no main channel deploy on Tuesday 25th September. While the RC channel deploys might be rescheduled for later this week, depending on the severity of the reason for them being cancelled in the first place, if they do not take place then it is probable that there will be no main channel roll-out again next week.