SL project news week 43/1: Server updates and llHTTPRequest

SL Server Updates

A brief start-of week update with an important item on the LSL llHTTPRequest function.

Tuesday October 23rd saw an update to the main channel which should have minimal impact on things, “It’s a change that should make simulators run better on our new hardware,” Simon Linden explained at the Simulator User Group meeting on the 23rd.

Wednesday 24th, as previously indicated, should see the RC channels updated as follows:

  • Magnum should receive bug fixes together with Baker Linden’s Group Services project code (the viewer side of which is still blocked)
  • LeTigre should receive further updates for the new Havok code (which presumably include fixes for the crash loop situation Maestro reported in the Server Beta UG meeting (see above)
  • BlueSteel should receive, “Some more invisible changes that should help us deal with some problems like full disks that make servers very unhappy.”

Details on the deployments are, as usual, posted in the Second Life Server section of the Technology forum.

Third-party Web Caching and llHTTPREQUEST

Kelly Linden indicated that the updates to LeTigre in week 42 had some library updates. One of these updates was to the cURL library which changed its behavior specifically around caching.

Until now, outgoing requests on have a Pragma: no-cache header in them, because cURL added this to all requests, and thus ensured fresh data was returned. The change made to the cURL library on LeTigre means that this is no longer the case, so if the third-party web server has caching enabled, any outgoing llHTTPRequest might return previously cached results from the server, rather than fresh data.

Kelly noted that, “Systems that are most likely to be affected are those that frequently hit the exact same URL and expect the data to change. Maybe they are getting a counter or checking on something’s status, leading to problems with the likes of breedables dying, and so on.”

A workaround for this has been implemented for the llHTTPRequest in the form of a HTTP_CUSTOM_HEADER flag, which enables Pragma: no-cache to be specified manually.

While there have currently been no reports of this or similar happening so far, LL are continuing to discuss the potential impact. Further, it is expected that the , Kelly continued, “If anyone uses or develops systems they think *might* be affected please give them a try on LeTigre this week and let me know. Or if you know others that might, encourage them to test on LeTigre this week. Thanks.”

So, if you have created a product which uses an external web server for updates in the manner described above, etc. (or know anyone that does), you may want to test behaviours on a LeTigre region to see if your product is impacted prior to this change rolling out further across the grid.

SL Issues

Network Traffic and Sim Lag / Crash

This issue has been going on since the start of the month, and appears to affect regions with large numbers of people.

A bug report was raised on the issue (BUG-355), and has been imported by LL as a MAINT issue (MAINT-1682). However, there has been no feedback from LL as to the underlying cause, although investigations are continuing.

SL project news: week 42 / 3 – server news

There is a lot going-on server-wise at the moment, so best to break it down by heading.

Server Rebalancing

The Lab is currently engaged in a rebalancing exercise in an attempt to put neighbouring regions on the same server and generally do a logical organizing of the grid to help improve various aspects of performance. Speaking at the Server Beta User Group on Thursday 18th October, Oskar Linden explained that there has been a lot of moving around (in terms of regions) within and between the Lab’s co-location facilities, and so the re-balancing is warranted and needed.

This work takes time – a rebalancing operation earlier this year took around 6 weeks to complete. It requires that regions are organized into groups and then generally moved twice: the first time to a temporary sever, the second to the target machine, each move requiring a reboot, so people may notice additional and unexpected restarts to regions they are in as the work progresses. Two moves are required because the server topology is so tight, it often isn’t possible to move regions directly from one server to a target server, so an intermediary is required.

While this work will take time to complete, the result should be improvements in stability and performance with the likes of teleports, etc, and even improved region crossings.

More on the Server Deployments for Week 42

  • Main channel: Oskar provided some more information on the main channel update of Tuesday 16th October, saying, “The main channel got a tweak this week, but it was a really small change, and no sim code got changed. We recognised that we had some inefficient SQL queries where large groups were concerned, so we optimised them, and the effect was quite noticeable. The databases are more responsive [and] this helps at all levels.” He went on to clarify that these changes were not Baker Linden’s Group Services code changes, after some in the meeting appeared to think this might have been the case
  • BlueSteel received the updates which were tested in the network pile-on test in week 42. At the time, I commented that teleporting seemed a lot faster, but that might have been a placebo effect of being on Aditi. It was. Commenting on the test, Oskar said, “There were no simulator changes in that test code. We were just testing backend tweaks.”
  • Magnum received no update per se, as previously reported, but was merged with trunk and then redeployed
  • LeTigre received the biggest update, which included new LSL functions ad updates, and most importantly of all, a new version of Havok (see below). Of the LSL functions, Maestro had a warning about the new OBJECT_PATHFINDING_TYPE parameter in the pathfinding command llGetObjectDetails, “We misspelled a constant, OPT_UNKNOWN, so we plan to fix that.” The fix will probably be next week.

Havok

As mentioned above, Havok on LeTigre was updated to version 2012.1. The update enables Havok’s built-in terrain optimisation and should lead to improved performance as a result of the physics shape of the terrain being simplified. Prior to the deployment, there were concerns that it would lead to issues with mesh vehicles trying to cross between regions running different version fo Havok, as has previously been the case.

As reported in part one of this update, these concerns led to Andrew Linden contacting the deployment team in LL to check whether it would be possible to ensure none of the Blake Sea regions remained on LeTigre while two versions of Havok running on the grid to help alleviate at least some of the pain people would feel when using mesh vehicles there. This apparently happened, whether it was before or after the deployment is unclear, as some people did report issues following the roll-out. There was also a little confusion as to what had been swapped where.

At the Server Beta meeting, Oskar gave the impression that all Blake Sea regions were on LeTigre. However, at the Simulator User Group meeting on Friday 19th October, Andrew Linden indicated that records showed none of the Blake Sea regions are running on LeTigre, although they are spread across the other channels. Given that there were (according to Andrew) around six Blake Sea regions running on LeTigre to start with, it would appear to make sense that they have been rotated off to another channel, rather than attempting to rotate all of Blake Sea on to LeTigre.

Please use the page numbers below left to continue reading this article

SL projects update week 42 / 1

Server Updates

The main channel deployment took place as planned on Tuesday 16th October. As previously indicated, this was the code deployed to the BlueSteel RC channel in week 41 (essentially an improved database query that should help with the back-end system load).

Of the Release Candidate channels, these are due to be updated on Wednesday 17th October as follows:

  • Magnum – will not receive an update, but will continue to run with the code deployed in week 41, probably in the same configuration
  • BlueSteel – will get code that’s almost the same as the main channel, with some OS-level configuration changes that shouldn’t be visible to anyone
  • LeTigre – will be getting a minor update to the Havok library which is mostly about getting our servers to build under Visual Studio 2010 on Windows and autobuild on Linux.

The LeTigre update will use “slightly newer” versions of the Havok libraries, so concerns were raised at the Server  / Sim meeting on Tuesday 16th October as to whether this may lead to a resumption of the problem with mesh vehicles being unable to travel between regions running different versions of Havok.Andrew Linden confirmed this might well be the case for mesh vehicles moving between LeTigre regions and other regions following the deployment.

To help reduce issues with situations like this arise, it was suggested that areas such as the Blake Sea regions are either removed from the RC channels, or placed on the same channel. While this would not solve the problem grid-wide, it would reduce the impact somewhat for people using mesh vehicles in these regions. A query was put to the LL deployment team on this by Andrew Linden, and they  agreed to try to make the Blake Sea regions more homogenous by ensuring they are all on the same channel.

SL Viewer

A further stability test build for the beta viewer was made on Friday October 12th, and reached the download page on Tuesday 16th (3.4.1.265898release notes) after being cleared by QA. This should be the last stability test release and should see the OK for code merges to resume. Merges and release priorities are still being looked at, and speaking at the Open Dev meeting on Monday 15th October, Oz indicated that there are “a few open source contributions in the pipeline that are in the mix”, as well as the anticipated LL merges such as the Steam code, Monty Linden’s HTTP library updates, Baker Linden’s Group Services project code, Apple OSX Mountain Lion support (including gatekeeper compatibility), etc.

Kelly Linden reports fixing SVC-7870 (Edit Linked Parts isn’t returning creator/owner), but given the current backlog, it may be a while before this makes it through to a beta  / release viewer.

Avatar Baking

The aim of this work (Project Sunshine) is to improve issues around avatar baking and to eliminate bake fail issues. It will primarily focus on moving the emphasis for the baking process from the viewer to a new Texture Compositing server. The viewer will retain some elements involved in avatar baking – the actual baking of the avatar shape (i.e. shape values and IDs) will still take place on the viewer side, for example.

As of Monday 15th October, no major news. Commenting at the Content Creation / Mesh Import meeting, Nyx Linden said, “Still plugging along at it :). It’s a complex project with many moving pieces, we’ll let you know when there are updates, and I will definitely be asking for beta testers here when we’re ready for feedback”.

Interest Lists and Object Caching

The focus of this project is to optimise the data being sent to the viewer, information already cached on the viewer and the manner in which that data is used in order to ensure it is used more efficiently so that things rez both faster and in a more orderly manner than is currently the case.

Interest lists and object rezzing: ironing-out the bugs, wherever they are

Andrew Linden continues to iron-out the bugs in the interests lists project, including one in the main viewer codebase wherein after crossing a region boundary the connection to the region you were just in will get reset after about 60 seconds. This is impacting the interest lists work and requires resolving, so Andrew is currently focused on trying to sort it out. A problem has also been reported with objects rezzing in the test regions on Aditi (e.g. Ahern) when moving through them in a vehicle, and will be looked into.

Pathfinding

A question was raised at the Content Creation / Mesh Import meeting on the 15th October as to why a 1-prim pathfinding character  has a land impact of 15. The reason for this is due to the increase physics load on the character. As previously covered, while this may seem harsh, it actually means that characters with a much higher prim count will also have a land impact of 15 (for example, a 30-prim character will still only have a land impact of 15), unless other factors (such as streaming cost) come into effect.

There are a couple of other issues with pathfinding characters which are being (or are about to be) looking at:

  • A bug whereby copies of single-prim characters only have a land impact of one (not 15). This problem is being addressed under PATHBUG-194.
  • A problem wherebypathfinding characters suddenly appear to “fly away” when adjusting your camera position, almost as if they are suffering from lag, and then reappearing there they should actually be (I gather this tends to happen when looking at a pathfinding character, which is following a set path then turning the camera away and then back again). Andrew Linden believes the problem is related to interest list updates, and will be looking into it.

Mesh

The patch to enhance the mesh uploader when dealing with rigged mesh items was discussed at the Content Creation Mesh Import group meeting on October 15th, with Nyx expressing interest in the idea, and agreeing with a suggestion that the patch needs to be formally submitted to LL’s bit bucket repo applied to a cloned version of the development viewer, supported by a JIRA outlining the patch and with a link to the repro.

Mesh uploda enhancement: suggested that it is submitted as a patch to LL

SH-3055 is a bug relating to mesh uploads which has been around for a while, but which appears to be affecting more people of late. With it, mesh uploads fail without any error message or warning on clicking CALCULATE or UPLOAD on the mesh upload floater. The issue is hard to track down (or even reproduce) as it doesn’t occur with any consistency. Either the upload works, or it simply sits as if waiting for something – whether it is waiting for data to be returned by the server, or whether it is receiving information and failing to action upon it.

Darien Caldwell and Nicky Dasmijn have been working with a debug viewer in an attempt to pin the problem down, but so far without success. One school of thought they are pursuing is that it is a problem with the viewer’s cURL wrapper (which is also thought to have been responsible for the recent crash issues being experienced in the beta viewer). The thinking behind this is that the problem appeared to come about with the introduction of a multi-threaded cURL in v3.2.5 of the viewer – with 3.2.4 having exhibited no major issues with uploading.Nyx Linden has stated he’ll take the problem to the team work on cURL to see if they can identify anything.

Materials Processing

No further updates. When talking to Geenz Spad and Oz Linden on Tuesday 16th October, Geenz could only say, “There’s not much to really report on materials for the time being unfortunately, but when there is something I’ll be more than happy to tell everyone.” Oz then added, “We’ll do more than tell you – we’ll give you something to play with :-)”.

Network Pile-on Test Update

Commenting on the thread for the pile-on test, Oskar Linden said: “All of the tests passed and the code will be going to RC next week. Thank you all for your help!”

With thanks to Baz deSantis for information on the Sim / server Group meeting.

Piling it on: the network optimisation tests

Thursday October 11th saw a huge response to Oskar Linden’s request for assistance with network optimisation tests, with many people logging-in to Aditi to join is Beta User Group meeting (I actually made it for the first time myself, the time of his meeting is generally a little awkward for me). More were available on the IRC channel established for the test as well.

Oskar’s meeting place at Morris on Aditi.

Things got off to a rocky start; mid-way through the Beta UG meeting everyone received the royal order of the boot, and problems occurred attempting to log back in. It transpired that an SSL certificate had expired at LL’s end and had to be renewed (through until 2015). Even so, not everyone appeared to make it back (or at least, not with their primary accounts!). Maestro Linden did make it back with the rest of us, and immediately sought protection in a state-of-the-art anti-crash system from Ordinal Malaprop* created (or is that crated?). No amount of coaxing could get him out, either:

Darien Caldwell: you can come out of the box now Maestro. Crash is over ;p

Mæstro Linden (maestro.linden): I’ll come out when I feel safe 🙂

People getting back after the crash and finding we have a Maestro Linden-in the-box

The meeting also had some disruption from an unhappy camper or two complaining about bans. One of them made it back following the initial forced log-out, and as final preparations were made for the test, appeared to successfully crash the region. Shortly before this happened, concerns were raised that this individual may have been trying to disrupt the IRC test channel, as they appeared to be passing commands aimed at IRC in local chat (at one point a little later, a similar command appeared in the IRC channel, and I and a number of others were, coincidentally or otherwise, disconnected).

The testing itself proceeded pretty much as planned, with everyone logging-in to a specified region at more-or-less the same time, testing the network capabilities in handling a large number of log-in updates in a single region. From my perspective, this went well, and as one of the initial people to log-in, I didn’t appear to suffer from the kind of lag usually associated with moving around in a region where there are a large number of people arriving.

Following the en-masse arrival, we dispersed to two regions for a group chat load test. I cannot actually say how this went, as I arrived at my designated region, only to take three steps and crash (an issue at my end of the SL equation rather than anything else).

I made it back to participate in the IM tests, which comprised piling-on a mass of IMs to targeted avatars and then awaiting their reply. I think I was one of the first to IM and one of the last to get a reply, Again, not through any failure in the system, but simply because Pey’s Law affected the tester I was IMing – he replied on receiving my first message, but forgot to press ENTER to send :).

The final part of the test was a mass teleport to a specified region, again presumably to test how the network handled a large number of arrivials within a region. While this may have been a placebo effect from being on Aditi, the teleport itself seemed to me to be somewhat faster than is usual, with the progress bar merely flicking up on the screen and then vanishing as I arrived. Once, there I also found walking around with people people teleporting-in also did seem to be as prone to mini-freezes or stutters as can be the case. However, the load on the target region (selected at the last-minute due to problems with the intended destination) may have been lighter than hoped, as it had a cap of 21 on the number of avatars allowed into it, and a number of people did report they were unable to teleport-in as a result (there were probably around 40-50 listed on the IRC chat page).

Pile-on Test Medal

Overall the tests made for a fun social gathering, with a lot of good humour all around, and Oskar and his team apparently gathering the data they wanted.

Hopefully, there will be further follow-up on the overall intent of the tests and the results in an upcoming Sim / Server UG meeting. Oskar certainly appeared pleased with the outcome, and was on the main grid after the tests to hand out medals to the participants (providing they knew the sekrit password! 🙂 ).

 * With thanks to DD Ra for pointing this out; I missed checking the creator details earlier.

SL projects update week 41 / 2

This item is a follow-on from part one, published earlier this week.

More Server News

At the Thursday Beta Grid User Group meeting (Thursday October 11th), and prior to the network optimisation tests, Oskar gave further news in the  serve deploys for week 41. On Tuesday 9th October, the main channel received code previously on BlueSteel, which in keeping with Simon Linden’s comments at the Monday Sim / Server UG meeting, Oskar referred to as being, “A pretty small release, just some server crash mode fixes; stability ++.”

On Wednesday October 10th, BlueSteel and LeTigre received a fix to some database queries that were really slow when accessing really large groups (note these were not Baker Linden’s Group Services code, that is being looked at as a deployment in week 42).

Monday 16th October may see some restarts on the grid in order to shuffle some regions onto new hardware, with the servers having more and faster CPU cores, which will increase the number of simulators running on the new servers, but they’ll be running on faster CPU cores.

Interest Lists and Object Caching

The short-version update for this comes from Andrew Linden, speaking at the Server Group meeting on Friday 12th October, “I thought I’d have something working this week… it isn’t quite working right. You can see it not working on Ahern on Aditi…” (!)

Interest list changes: easing the pain of random rezzing

He went on more seriously to explain that while the new code is working correctly for the most part, and that rezzing orders should be improved / faster, there are some problems with objects which should be in view of an avatar not showing up and a major issue around teleporting into high ground.

When the latter happens, you effectively arrive “underground” (presumably at the default “ground level” for the simulator  – 21 metres in the case of unterraformed land). The simulator then calculates where you should be and moves your avatar appropriately. With the new code, this has the effect of breaking the server’s notion of the camera – where it is and what it can see – which is used to figure out what objects to send to the viewer. This means that the camera itself cannot be moved or updated.

There have been some performance tests on an older version of the code, which have been mixed, as Andrew also explained, “here were two performance tests run on an earlier version. One test (mostly empty region with about 30 avatars running around) showed a slight decrease in performance… about 5% worse. Another test (crowd of avatars NOT looking at a pile of dynamic objects behind them) showed about 40% improvement (less time spent running the interest list). So I went back to the code to try to fix the first test, and I think I’ve got something that will be as good or better all around.”

The code will also see changes as to how the camera behaves and in the resultant level of detail. Andrew is currently working on limiting the distance the camera cam be moved away from the avatar. Note this is not limiting Draw Distance, but limiting the distance the camera can be freely moved independently of the avatar. He’s considering 128 metres to be the likely range. There are two reasons for this.One is to prevent the camera wandering into regions which are more than one neighbouring region away, the other is because as the camera moves laterally, detail levels degrade, because object detail is tied to the avatar’s position (hence why, when you zoom a great distance, buildings and objects may only appear to partially rez, etc.). Under the new system, object detail will be tied to the camera, so that little degradation is experienced. However, in order for this to work, the camera must be kept within a reasonable distance of the avatar; if it is moved too far, the detail will start to degrade once more (presumably because of the volume of data the viewer is trying to handle).

Mesh Deformer

On Thursday 11th October at the Open/Dev meeting, Darien Caldwell outlined her ideas for using base shape info exported from Second Life when uploading rigged meshes.

If this works, it will essentially mean that rather than being restricted to using a default base female or male shape when uploading rigged meshes, creators will be able to download a human shape as an XML file (permissions allowing), and then specify this shape when uploading rigged meshes.  The basic code for handling the upload with specific avatar shape information has already been added to the deformer by Qarl Fizz, so Darien is focusing on the best way to use it, her work going into a fork of the existing Mesh Defromer project viewer.

Avatar shapes can currently be exported from a viewer via DEVELOP -> AVATAR -> APPEARANCE TO XML (again subject to the permissions system). This saves the avatar shape data as an XML file, which contains the settings from the appearance sliders, and which is automatically saved to your computer (generally to  C:\Users[USERNAME]\AppData\Roaming\SecondLife\user_settings for Windows).

To associate an avatar shape .XML file with a mesh, Darien is proposing a further revision to the mesh uploader floater, and has provided an early mock-up as to how it might look.

New option to associate an avatar shape XML file with a mesh on upload (image courtesy of Darien Caldwell)

More work is required the flesh-out this idea, including, as Oz noted at the Open/Dev meeting, making the shape export option more obvious for people to use, which will more than likely see it moved out of the Develop menu, wherein it is currently nested. The .XML file itself is not suitable for use directly in most 3D modelling programmes, so how the exported data might be used with these when creating mesh items remains to be seen. nevertheless, if successful, Darien’s approach may offer a more fine-tuned solution to developing mesh clothing to a range of shapes.

Other items

Viewer and FMOD

The use of FMOD has been the subject of much discussion within the TPV/Dev meetings of late. FMOD is used within the sound system for the Viewer, and until now, Linden Lab has provided a script which pulls library files from an FMOD repository for use in viewer builds. However, following what appears to have been a clean-up of their archives, FMOD have removed the some of the legacy files required for this, as reported in JIRA OPEN-150.

Some viewer developers have already started using FMODex within their builds (e.g Singularity 1.7.0+), which also addresses issues with sound quality as well. Other TPVs are looking at possibly integrating this work into their builds.

It currently appears as though Linden Lab themselves are looking to integrate FMODex, as they see this very much as something which needs to be addressed. Speaking at the TPV/Dev meeting on Friday 5th October, Oz Linden stated: “I got around to forwarding the JIRA on that to our engineering manager for Second Life, and he agreed with me that it is something we should definitely do something about. I’m not sure what the time-table on that will be, it’s going to go into the hopper for the next ‘Things we should do something about, what priority are they compared to all the other things we should do something about’ meeting, which happens weekly.” While openAL has also been suggested as an alternative, it does seem more likely that FMODex will be adopted, something which was hinted at by Oz when talking at the Open/Dev meeting on Thursday 11th October.

Teleport Timeouts

Baker Linden has been looking into the issue of teleport timeouts, and has managed to pin down one cause as a reproducible bug. He’s not sure as to whether it can be fixed, and is currently investigating further as to why it is happening.

 

Network Optimisation: LL seek assistance

Linden Lab are in the process of changing the manner in which network traffic is handled within Second Life, and require assistance in testing the changes made to date in order to ensure various services are functioning correctly prior to rolling-out the changes to the grid as a whole.

To this end, Oskar Linden posted the following request in the Server forum late on Wednesday 10th October:

Linden Lab has made some changes to the way regions handle network traffic. We need your help to test. This will help us insure that regions are communicating appropriately over the network. Mainly we are concerned with agents entering via direct login, teleport, and region crossing. As well as other functions such as IMs, Voice, and Group chat.

Tomorrow, October 11th, at 4PM Pacific time (right after the server beta user group) we will conduct these tests with as many people as we have. Testing will take place on ADITI and require out of world communication we will be coordinating via IRC. This will require an IRC client connected to EFNET in the channel #sltest. If you don’t know how to do that you have until tomorrow afternoon to figure it out. 🙂

The details on the tests are here:

 – https://wiki.secondlife.com/wiki/Networking_Optimization_Pile_On_Tests

I hope you can come and help us test these new changes.

__Oskar

The tests will comprise three parts:

  • Direct log-in test to a pre-determined location on the beta grid.
  • A group chat test via the Second Life Beta in-world group.
  • A teleport test to a pre-determined location.

Note that Voice is also indicated as one of the services to be tested, but no details on what this will entail have as yet been included in the test notes – please check both the note and Oskar’s forum thread for possible updates on this ahead of testing.

Those wishing to take part in the tests will need to:

  • Be members of the Second Life Beta in-world group
  • Be able to log-in to the Aditi beta grid

The tests will be coordinated on IRC using the EFnet channel #sltest, and those involved in the tests will need to be able to access this channel either via the EFnet website or through an IRC client.

Accessing the #sltest channel on the EFnet website – note you do not require a registered account; you can access the channel using any suitable nickname

The tests are due to commence at 16:00 SLT.

Note that these changes are not related to the region lag issue sudden and massive lag spikes, as reported in the Server forum threads, but rather appear to be part of ongoing network-related work.