SL project updates week 14/2: server, viewer, CDN

The Trace Too; Inara Pey, March 2015, on Flickr The Trace Too (Flickr) – blog post

Server Deployments Week 14 – Recap

As always, please refer to the server deployment thread in the forums for the latest information and updates.

  • There was no deployment to the Main SLS channel on Tuesday, March 31st, due to the inventory issues arising from the week #13 RC deployment – see my update here for details.
  • On Wednesday, April 1st, all three RCs received the same update to the current server maintenance package to fix the issues with Trash failing to purge in non-AIS v3 viewers (see BUG-8877. and my coverage of the recent issues here). Those suffering from inventory fetch failures on RC regions are advised to re-enable HTTP Inventory in their viewers, if disabled (found under the Develop menu).

SL Viewer

Wednesday, April 1st saw the release of the Project BigBird viewer (yes, seriously!), version 3.7.27.300377, which contains the various fixes for attachment issues which the Vir Linden has been working on. Specific fixes offered are listed as (note the MAINT designations are for the Lab’s internal JIRA, and thus non-viewable):

  • MAINT-4351 HUDs and attachments intermittently and randomly detach after teleports, sometimes reattaching on their own shortly after, sometimes staying detached completely, or showing as “worn on Invalid Attachment Point” while still detached
  • MAINT-4653 [Attachment-RC] When using “Add” or “Attach to” to attach multiple attachments at the same time, some attachments fall off and some get attached to the wrong attachment point
  • MAINT-4917 Attaching multiple objects generates multiple bake requests
  • MAINT-4918 Removing multiple attachments generates redundant detach requests
  • MAINT-4919 Attempting to wear an outfit with more than 40 attachments will fail

UDP Paths: HTTP Inventory, Textures and More

As noted at the top of this report, the week #13 RC deployments have been causing some inventory-related issues, one of which –  the Trash purging problem – has been fixed with this week’s RC RC deployment.

The second issue  – failures in inventory fetching following clearing cache on RCs regions – has been caused by a combination of the Lab deprecating the UDP message path for inventory updates and users having the HTTP Inventory option in the viewer (found under the Develop menu – CTRL-ALT-Q) disabled (unchecked).

Given this path has been deprecated, it is essential you keep HTTP Inventory enabled (the Lab will be removing the option from the Develop menu in the future to prevent it being unwittingly disabled).

Speaking at the Server Beta Meeting on Thursday, April 2nd, Oz Linden indicated that the Lab would be taking steps in the future to deprecate UDP messaging is “high on the list” for being deprecated in the future, given that textures have now moved to the CDN.

The CDN and Switching Further Services

While discussing the issue of UDP messaging, Oz again re-iterated the desire to pivot things like fetching animations and sounds away from UDP and onto HTTP, with the aim of provisioning them through the CDN, further lifting the load the simulators currently carry. However, he caveated this with two important points:

  • While this is something he’d like to see done, and is in the plans for SL’s future, the work hasn’t actually be scheduled yet, must less started; therefore it is not something that will be happening in the short-term (or perhaps even the medium term)
  • The Lab is working on a further round of CDN improvements – again, no time scale is available for their implementation – but there won’t be any additions to the data delivered via the CDN until after such improvements have been deployed.

One aspect here is that, in terms of the simulator load and in terms of the vast majority of users, the switch-over to avatar, mesh and texture data to CDN-based services has been a success for the Lab. However, as we’ve also seen, it has resulted in issues for some users, up to and including what is a degraded service due to the actions of at least one ISP.  While the latter is not something the Lab or their CDN provider can directly tackle, it does point to the fact that while off-loading the heavy lifting from the Lab’s servers can make for improvements, it can affect users in other ways.

Hence why the Lab is being cautious in approach, and is continuing to work with its CDN providers to try to improve the service as far as can be done, in the hope of reducing the number of ways in which users might find SL a poorer experience as a result of the CDN implementation. However, exactly what can be achieved and issues mitigated, remains to be seen.

In the meantime, as as per part 1 of this week’s update, if you do feel mesh and texture rendering isn’t what it once was, try following Monty Linden’s interim ideas  for easing things.

SL project updates week 10/1: server, general news

Leka, Nordan om Jorden; Inara Pey, March 2015, on Flickr Leka, Nordan om Jorden (Flickr) – blog post

Server Deployments

Tuesday, March 3rd, saw the Main (SLS) channel receive the server maintenance package deployed to the RC channels in week #9. This includes:

  • A server-side fix for BUG-8297, “Unable to teleport anywhere using SLGO”
  • Improvements to server logging.

There were no scheduled deployments to the RC channels on Wednesday, March 4th.

Group Chat

Following the last deployment of back-end group chat changed during week #9, some large groups with active group chat have reported an increase in issues of message failures, although they appear to do so somewhat randomly, with some people seeing them and others simply not receiving them at all.

Commenting on the problem at the Simulator User Group meeting on Tuesday, March 3rd, Simon Linden summarised the situation thus:

In short, yes, it’s cranky, and yes, we’re (as in I am) looking at it … the chat server itself is actually running better than before, believe it or not. A back-end service it relies on, what we call “agent presence” [used to help locate someone on the grid], seems to be having new problems, so the changes may have added load to those servers and is causing problems, or something else unexpectedly changed … [So] some people don’t get the messages when chat is failing … it’s dropping sending some updates and messages when it times out with some other internal requests.

Further updates will be provided as the Lab / Simon continues to look at the problems.

CDN Notes

There have been recent reports of people experiencing slow texture and mesh load issues, leading to questions concerning the CDN service (although some of the issues that have been mentioned might be related to local caching more than the CDN). In particular questions have been asked as to how long a CDN server retained its cache of data relating to regions prior to going “cold” and requiring a “reload” from the SL services. Commenting on this at the Open-source Developer meeting on Monday, March 2nd, Oz Linden said that some CDN caches do age out more quickly than others.

The Lab has also been experimenting with more than one CDN provider, and are continuing with different CDN configurations as well to further tune things, as well as continuing to measure results; so we may yet see further changes  / improvements, and a possible decrease in instances that may be related to “cold” CDN loads.

Other Items

Rigged Mesh Crashers

The Server Beta Meeting on Thursday, February 26th saw the issue of a “new” mesh crasher being used on the grid. This is essentially a deliberately corrupted rigged mesh attachment which, when worn will cause viewers around it to immediately crash, with no warning or ability to take preventative action, such as muting the offending avatar.

Just over a year ago, some advice was given on how to counter graphics crashers by adjusting the viewer’s debug settings, and some people many be getting pointed towards it again in order to avoid being affected by the “new” crasher.

However, changing the specified debug settings can lead to a failure to render much of what you actually want to see, as noted in  this comment following the article. At the time the advice was given, the Firestorm team tracked many of the problems their users were experiencing directly to the settings having been changed. Ergo, if you are pointed to this particular article as a means of combating graphics crashers, please keep in mind you may gain undesirable results, and keep a note of the original settings so you can switch back to them should this be the case.

During the discussion on this matter at the SBUG meeting, speculation was raised on whether or not the forthcoming new viewer rendering controls (see: STORM-2082). opinion is divided, as the viewer downloads the data which may cause a graphic crash and starts processing some of it in order to determine what to render or not, and even this initial processing could be enough to crash it.

SL Feed Issues

There has been an uptick in the number of snapshot uploads to the SL feeds failing over the course of the last week, with some additionally reporting issues of comments failure to appear / “loves” failing to stick. Some users also reported issues over the weekend with web profiles failing to load, and a JIRA (see BUG-8677) was logged on this issue on March 3rd.

The last several days have seen people again encounter issues with snapshots failing to process / display in their feeds
The last several days have seen people again encounter issues with snapshots failing to process / display in their feeds

Whether the two issues had a common cause isn’t clear, but as the latter has been resolved, and you are one of those continuing to experience snapshot upload failures, please file a JIRA providing as much information as possible (links to any feed post with a missing snapshot, date / time of upload, number of failures, etc.).

SL project updates week 51/2: SBUG, TPV Developer meeting

Frisland, Laluna Island (Flickr) – blog post

The following notes are taken from the Server Beta User Group (SBUG) meeting held on Thursday, December 18th, 2014, and the TPV Developer meeting held on Friday, December 19th. A video of the latter is included at the end of the article, my thanks as always to North for recording it and providing it for embedding.

With reference to the meeting video, summary notes are provided below with time stamps to assist is spotting and listening to the associated conversations.

Server Deployments Week 51 – Recap

  • On Tuesday, December 16th, the Main (SLS) channel was updated with the server maintenance package deployed to the three RC channels in week #50
  • There were no deployments to the RC channels.

The end-of-year code freeze / no change window comes into effect from the end of the week, this means there will be no further server updates until January 2015.

SL Viewer

Release Viewer

The Maintenance RC viewer, version 3.7.23.297296, was promoted to the de facto release viewer on Thursday, December 18th. This viewer comprises a solid collection of bug fixes and improvements to many areas of SL, and also includes a range of fixes to previously released changes in the way joint offsets in rigged meshes are handled. Please refer to the release notes for further information.

Experience Keys RC Viewer

On Wednesday, December 17th, the Experience Keys / Tools viewer was updated to release candidate status with the release for version 3.7.23.297364. Please refer to my overview of the viewer (written while it was at project viewer status) for information on the viewer.

Further RC Updates

[00:50] As a result of the promotion of the Maintenance RC, both the new Experience Keys RC viewer and the HTTP Pipelining RC viewer are currently being rebuilt to include the Maintenance release code. These updates may appear in the release viewer pipeline on Monday, December 22nd, or they may be held over from release until after the end of the no change window.

Viewer Build Tools Project

[01:41] The new year should also see the first release of a project viewer for Mac and Windows built using the new build tools chain and autobuild process.

Group Chat

The last of the 2014 updates are being deployed to the back-end servers. At the time of the Server Beta User Group meeting, there were just a “few more” hosts that had yet to receive the updates, so things should be completed in short order. These improvement are focused on improving the overall robustness of the service and dealing with overload conditions.

CDN Work

What is being referred to as a “mini CDN” test was carried out on the BlueSteel region on the morning (PDT) of Thursday, December 18th. The test was designed to check a more flexible  CDN configuration that is going to make it easier for  the Lab to deal with fall overs. “It should be invisible normally but lets us have better control of where the viewer gets those mesh and texture assets,” Simon Linden said of the work, which will likely see a formal deployment in the New Year.

Viewer-managed Marketplace (VMM)

[03:53] There was an in-world meeting held on Friday, December 12th to discuss the Viewer-managed Marketplace (notes and transcript).

There should be a summary post from the Lab, covering JIRAs raised on VMM and comments made on the forums, which should be appearing on the current forum thread around the time this update is published. A further feedback meeting is being planned for the New Year.

Continue reading “SL project updates week 51/2: SBUG, TPV Developer meeting”

Monty Linden discusses CDN and HTTP

Monty Linden talking CDN and HTTP
Monty Linden talking CDN and HTTP

In show #46 of The Drax Files Radio Hour, which I’ve reviewed here, Draxtor pays a visit to the Lab’s head office in Battery Street, San Francisco. While there, he interviews a number of Linden staffers – including Monty Linden.

Monty is the man behind the Herculean efforts in expanding and improving the Lab’s use of HTTP in support of delivering SL to users, and which most recently resulted in the arrival of the HTTP Pipeline viewer (the code for which is currently being updated).

He’s also been bringing us much of the news about the content delivery network (CDN) project, through his blog posts; as such, he’s perhaps the perfect person to provide further insight into the ins and outs of the Lab’s use of both the CDN and HTTP in non-technical terms.

While most of us have a broad understanding of the CDN (which is now in use across the entire grid), Monty provides some great insights and explanations that I thought it worthwhile pulling his conversation with Drax out of the podcast and devoting a blog post on it.


Monty Linden talks CDN and HTTP with Draxtor Despres on the Drax Files Radio Hour

Monty starts out by providing a nice, non-technical summary of the CDN (which, as I’ve previously noted, is a third–party service operated by Highwinds). In paraphrase, this is to get essential data about the content in any region as close as possible to SL users by replicating it as many different locations around the world as is possible; then by assorted network trickery, ensure that data can be delivered to users’ viewers from the location that is closest to them, rather than having to come all the way from the Lab’s servers. All of which should result in much better SL performance.

“Performance” in this case isn’t just a case of how fast data can be downloaded to the viewer when it is needed. As Monty explains, in the past, simulation data, asset management data, and a lot of other essential information ran through the simulator host servers. All of that adds up to a lot of information the simulator host had to deliver to  every user connected to a region.

The CDN means that a lot of that data is now pivoted away from the simulator host, as it is now supplied by the CDN’s servers. The frees-up capacity on the simulator host for handling other tasks (an example being that of region crossings), leading to additional performance improvements across the grid.

LL's CDN provider (Highwinds) has a dedicated network and 25 data centres around the world which should help to generate improvements in the speed and reliablity of asset downloads to your viewer, starting with mesh and textures
Highwinds, a CDN provider Linden Lab initially selected for this project, has 25 data centres around the world and a dedicated network from and through which essential asset data on avatar bakes, textures and meshes (at present) can be delivered to SL users

An important point to grasp with the CDN is that it is used for what the Lab refers to as “hot” data. That is, the data required to render the world around you and other users. “Cold” data, such as the contents of your inventory, isn’t handled by the CDN. There’s no need, given it is inside your inventory and not visible to you or anyone else (although objects you rez and leave visible on your parcel or region for anyone to see will have “hot” data (e.g. texture data) associated with it, which will gradually be replicated to the CDN as people see it).

The way the system works is that when you log-in or teleport to a region, the viewer makes an initial request for information on the region from the simulator itself. This is referred to as the scene description information, which allows the viewer to know what’s in the region and start basic rendering.

This information also allows the viewer to request the actual detailed data on the textures and meshes in the region, and it is this data which is now obtained directly from the CDN. If the information isn’t already stored by the CDN server, it makes a request for the information from the Lab’s asset servers, and it becomes “hot” data stored by the CDN. Thus, what is actually stored on the CDN servers is defined entirely by users as they travel around the grid.

The CDN is used to deliver
The CDN is used to deliver “hot” texture and mesh data – the data relating in in-world objects – to the viewer on request

The HTTP work itself is entirely separate to the CDN work (the latter was introduced by the Lab’s systems engineering group while Monty, as noted in my HTTP updates, has been working on HTTP for almost two-and-a-half years now). However, they are complimentary; the HTTP work was initially aimed at making both communications between the viewer and the simulator hosts a lot more reliable, and in trying to pivot some of the data delivery between simulator and viewer away from the more rate-limited UDP protocol.

As Monty admits in the second half of the interview, there have been some teething problems, particularly in when using the CDN alongside his own HTTP updates in the viewer. This is being worked on, and some recent updates to the viewer code have just made it into a release candidate viewer. In discussing these, Monty is confident they will yield positive benefits, noting that in tests with users in the UK,, the results were so good, “were I to take those users and put them in out data centre in Phoenix and let them plug into the rack where their simulator host was running, the number would not be better.”

So fingers crossed on this as the code sees wider use!

In terms of future improvements / updates, as Monty notes, the CDN is a major milestone, something many in the Lab have wanted to implement for a long while,  so the aim for the moment is making sure that everyone is getting the fullest possible benefit from it. In the future, as Oz linden has indicated in various User Group meetings, it is likely that further asset-related data will be moved across to the CDN where it makes sense for the Lab to do this.

This is a great conversation, and if use of the CDN has been confusing you at all, I thoroughly recommend it; Monty does a superb job of explaining things in clear, non-technical terms.

Lab issues further CDN update – more improvements coming

On November 1st, 2014, the Lab blogged about improvements seen from their side of things as a result of CDN support deployment.  At the time the updates were being issued, the Lab was also asking for feedback from users as to how things were going for them.

As a result of this request for feedback, the Lab issued a further update on the improvements on Friday, November 7th,, and it is a tale of two halves.

The first part of the blog post re-states the core benefits that have been seen as a result of the CDN deployment for mesh and texture data, which is again split into two key areas: a considerable reduction in the load on some key systems on the simulator hosts, and a big performance improvement in texture and mesh data loading, resulting in users seeing faster rez times in new areas they’re visiting.

In the graphic released with the Lab's November 7th update on the CDN deployment,
From: An update on the CDN project, linden Lab, November 7th, 2014

However, the experience of some users hasn’t been so good, as reported in the forum thread, and it could not be put down to matters of distance from the CDN nodes vs. the Lab’s simulators, or to people experiencing slower load times as a result of being the very first to enter a region which had not been cached at the local CDN node.

This feedback encouraged the Lab into further investigation and data-gathering of specific situations, allowing them to engage with CDN supplier Highwinds in order to try to determine possible reasons for the poorer experiences. The second part of the blog post notes the outcome of these efforts:

We believe that the problems are the result of a combination of the considerable additional load we added to the CDN, and a coincidental additional large load on the CDN from another source. Exacerbating matters, flaws in both our viewer code and the CDN caused recovery from these load spikes to be much slower than it should have been. We are working with our CDN provider to increase capacity and to configure the CDN so that Second Life data availability will not be as affected by outside load. We are also making changes to our code and in the CDN to make recovery quicker and more robust.

The blog post also points out some of the risks involved when trying to deploy large-scale changes to a complex and dynamic environment such as Second Life:

Making any change to a system at the scale of Second Life has some element of unavoidable risk; no matter how carefully we simulate and test in advance, once you deploy at scale in live systems there’s always something to be learned. This change has had some problems for a small percentage of users; unfortunately, for those users the problems were quite serious for at least part of the time.

The post concludes by thanking all those who contributed to helping the Lab understand the nature of the problems being experienced and in taking the time to help provide data on their particular circumstances which helped with further investigations, and with a note that it is hoped that the changes that are to be made as a result of this work will reduce such problems, allowing more people to enjoy the benefits offered through the use of the CDN for asset data delivery.

CDN – Lab issues data on improvements

On top of their feature blog post on recent improvements to SL, on which I also blogged, the Lab has also issued a Tools and Technology update with data on the initial deployment of the CDN.

Entitled CDN Unleashed, the post specifically examines the percentage of simulator servers experiencing high load conditions (and therefore potentially a drop in performance) on the (presumably) BlueSteel RC both before and after deployment of the CDN service to that channel – and the difference even caught the Lab off-guard.

ss
Charting servers on a production release candidate channel with high HTTP load conditions before and after we rolled the CDN code onto them (image via Linden Lab)

While a drop in load had been expected prior to the deployment, no-one at the Lab had apparently expected it to be so dramatic that it almost vanishes. Such were the figures that, as the blog post notes, at first those looking at them thought there was something wrong, spending two days investigating and checking and trying to figure out where the error in data came from – only it wasn’t an error; the loads really have been dramatically reduced.

Elsewhere, the blog post notes:

Second Life was originally designed for nearly all data and Viewer interactions to go through the Simulator server. That is, the Viewer would talk almost exclusively to the specific server hosting the region the Resident was in. This architecture had the advantage of giving a single point of control for any session. It also had the disadvantage of making it difficult to address region resource problems or otherwise scale out busy areas.

Over the years we’ve implemented techniques to get around these problems, but one pain point proved difficult to fix: asset delivery, specifically textures and meshes. Recently we implemented the ability to move texture and mesh traffic off the simulator server onto a Content Delivery Network (CDN), dramatically improving download times for Residents while significantly reducing the load on busy servers.

Download times for textures and meshes have been reduced by more than 50% on average, but outside of North America those the improvements are even more dramatic.

Quite how dramatic for those outside North America isn’t clear, quite possibly because the Lab is still gathering data and monitoring things. However, the post does go on to note that in combination with the HTTP pipelining updates now available in the current release viewer (version 3.7.19.295700 at the time of writing), the CDN deployment is leading to as much as an 80% reduction in download times for mesh and texture data. Hence why the Lab is keen to see TPVs adopt the HTTP code as soon as their release cycles permit, so that their users can enjoy the additional boost providing the code on top of enjoying the benefits offered by the CDN.

Again, at the time of writing, the following TPVs already have the HTTP pipelining code updates:

As per the Performance, Performance, Performance blog post, the Lab want to hear back from users on the improvements. Comments can be left on the Performance Improvements forum thread, where Ebbe and Oz has been responding to questions and misconceptions, and Whirly Fizzle has been providing valuable additional information.