Second Life’s August 23rd woes by April Linden

secondlifeTuesday, August 23rd was not a particularly good day for Second Life, with an extended period of unscheduled maintenance with log-ins suspended and those in-world advised to refraining from rezzing No Copy objects, or making any LindeX related transactions, etc.

At the time of the problem, there was speculation that it might be due to further issues with the central database node (and kudos for Caitlyn for suggesting this 🙂 ). Writing in a Tools and Technology blog post on August 24th, Operations Team lead April Linden confirmed this was in fact the case:

Shortly after 10:30am [PDT], the master node of one of the central databases crashed. This is the same type of crash we’ve experienced before, and we handled it in the same way. We shut down a lot of services (including logins) so we could bring services back up in an orderly manner, and then promptly selected a new master and promoted it up the chain. This took roughly an hour, as it usually does.

Given this has happened in the relatively recent past (see here and here), the Ops Team are getting pretty good with handling these situations. Except this time there was a slight wrinkle in the proceedings. The previous failures had occurred when concurrency was relatively low due to the times they occurred. This time, however, the problem hit when rather a lot of people were trying to get into SL, so as April notes:

A few minutes before 11:30am [PDT] we started the process of restoring all services to the Grid. When we enabled logins, we did it in our usual method – turning on about half of the servers at once. Normally this works out as a throttle pretty well, but in this case, we were well into a very busy part of the day. Demand to login was very high, and the number of Residents trying to log in at once was more than the new master database node could handle.

Around noon we made the call to close off logins again and allow the system to cool off. While we were waiting for things to settle down we did some digging to try to figure out what was unique about this failure, and what we’ll need to do to prevent it next time.

It wasn’t actually until a third attempt was made to bring up the login hosts one at time that things ran smoothly, with services being fully restored at around 2:30pm PDT.

Now, as April notes, she and her team have a new challenge to deal with: understanding why they had to turn the login servers back on much more slowly than in the past. There is, however, a bright spot in all this: the work put into making the Grid Status feed more resilient paid off, with the service appearing to cope with the load placed on it by several thousand people trying to discover what was going on.

None of us like it when the go wrong, but it’s impossible for SL to be all plain sailing. What is always useful is not only being kept informed about what is going on when things do get messed up (and don’t forget, if you’re on Twitter you can also get grid status updates there as well), but in also being given the opportunity to understand why things went wrong after the fact.

In this respect, April’s blog posts are always most welcome, and continue to be an informative read, helping anyone who does so just what a complicated beast Second life is, and how hard the Lab actually does work to try to keep it running smoothly for all of us – and to get on top of this as quickly as they can when things do go wrong.

SL project updates 16 20/1: server, viewer, Aditi

Regency Buckingham; Inara Pey, May 2016, on FlickrRegency Buckingham – The King’s Gallery – blog post

Server Deployments Week #20

As always, please refer to the server deployment thread for the latest updates.

  • On Tuesday, May 17th, the Main (SLS) channel was updated with a server maintenance package previously deployed to the RC channel, containing minor internal improvements and a crash fix.
  • On Wednesday, May 18th, all three RC channels should be updated with a new server maintenance package, originally held over from week #19, described as also minor internal improvements with no visible functional changes to Second Life.

SL Viewer

A new RC viewer appeared in the release channel on Monday, May 16th. Version 4.0.5.315019 is the anticipated Inventory Message Viewer. This viewer comprises Aura Linden’s work removing from the viewer all of the old UDP inventory messaging paths which have already been replaced by more robust mechanisms (and in some cases already had the server-side support for them removed), but which have until now remained a part of the viewer’s code.

A full list of the messages which have been removed can be found in the release notes for the viewer, and it is noted that any messages in the list which still have back-end support will see that support removed in the near future.

This means the current SL viewers which are available comprise:

  • Current Release version: 4.0.4.314579 (dated April 28th, promoted May 5th) – formerly the Maintenance RC viewer
  • Release candidate viewers:
    • Quick Graphics RC viewer, version 4.0.5.315117, dated May 11th – comprises the graphics pre-sets capability and the new Avatar Complexity settings
    • Inventory Message RC viewer, version 4.0.5.315019, as noted above
  • Project viewers:
    • Project Bento viewer, version 5.0.0.314884 dated May 5th containing several updates related to joint offsets and meshes and slider changes
    • Oculus Rift project viewer, version 3.7.18.295296, dated October 13th, 2014 – Oculus Rift DK2 support
  • Obsolete platform viewer version 3.7.28.300847, dated May 8, 2015 – provided for users on Windows XP and OS X versions below 10.7.

Project Bento

As a reminder, it is anticipated that server-side support for Project Bento will be enabled on the main (Agni) grid some time during week #21, to allow for more extensive testing of the new avatar skeleton capabilities. Those wishing to try the skeleton extensions and new sliders when rigging mess models will need to use the Bento project viewer or a third-party viewer with the Bento code.

Note that if you are running a non-Bento viewer and happen across someone testing the Bento capabilities, any mesh they are wearing rigged to the new Bento bones will appear distorted  / broken in your view.

Aditi Grid

Issues continue with Aditi (the beta grid), notably with apparent inventory content loses and even the potential for inventory corruptions (see BUG-16714 for details of some of the issues being encountered).

These problems take the form of assets appearing in inventory, but generating a “Missing from database” error when attempting to rez / wear / attach. Some reports suggest the issue is restricted to items added to Aditi inventories following the most recent syncing operations between Agni and Aditi.  Normal corrective actions, such as clearing cache, do not correct matters.

The Lab staff looking after the beta grid have been appraised of the situation, and summed-up their response in a single phrase (and I’m apparently quoting): “bleargh!” – an understandable reaction, given the upsets Aditi caused in week #19. They are however, digging into the problem.

Of outages and feedback

secondlifeI normally keep a close eye on outgoing communications from the Lab, but this week I’ve had other things distracting me, and so haven’t been keeping an eye on the official blog for posts and updates. My thanks therefore to reader BazdeSantis for pointing me to April Linden’s Tools and Technology update, The Story Behind Last Week’s Unexpected Downtime.

April has very much become the voice of the Lab’s Operations team, and has provided us with some excellent insights to Why Things Sometimes Went Wrong – a valuable exercise as it increases both our understanding of the complexities inherent in Second Life, and also what is likely to be going on behind the scenes when things do go drastically sideways.

April’s post refers to the issues experienced on Friday May 6th, when a primary node of a central database failed, with April noting:

The database node that crashed holds some of the most core data to Second Life, and a whole lot of things stop working when it’s inaccessible, as a lot of Residents saw.

When the primary node in this database is off-line we turn off a bunch of services, so that we can bring the grid back up in a controlled manner by turning them back on one at a time.

There’s an interesting point to note here. This is the same – or very similar – issue to that which occurred in January 2016, which again goes to show that given the constant usage it sees, Second Life is a volatile service  – and that the Operations team are capable of turning major issues around in a remarkably short time; around 90 minutes in January, and less than an hour this last time.

Both events were also coupled with unexpected parallel issues as well: in January,  the database issue was followed by issues with one of the Lab’s service providers – which did take a while to sort out. This time it was the Grid Status service. As I’ve recently reported, the Grid Status web pages have recently moved to a new provider. A couple of changes resulting from this have been with the RSS Feed, and integrating the Grid Status reporting pages with the rest of the Lab’s blog / forum Lithium service. However, as April also notes:

It can be really hard to tune a system for something like a status blog, because the traffic will go from its normal amount to many, many times that very suddenly. We see we now have some additional tuning we need to do with the status blog now that it’s in its new home.

She also points out that people with Twitter can also track the situation with Second Life by following the Grid Status Twitter account.

April’s posts are always welcome and well worth reading, and this one is no exception. We obviously don’t like things when the go wrong, but it’s impossible for SL to be all plain sailing. So, as I’ve said before (and above), hearing just what goes on behind the scenes to fix things when the do go wrong helps remind and inform us just how hard the Lab actually doe work to keep the complexities of a 13-year-old platform running for us to enjoy.

 

SL project updates 16 13/1: Aditi inventory, invisiprims

[G]aio; Inara Pey, March 2016, on Flickr [G]aioblog post

Server Deployments Week #13

There are no scheduled deployments or restarts planned for the week. The next deployment should occur in week #14 (week commencing Monday, April 4th, when the release candidate channels should receive a server maintenance package containing some (as yet)  unspecified fixes.

SL Viewer

The Project Bento viewer, containing the new avatar skeleton extensions, updated on Tuesday March 29th to version 5.0.0.313150. The remaining viewer channels remain unchanged from the end of week #12:

  • Current Release version: 4.0.2.312269, dated March 17th – formerly the Maintenance RC viewer
  • Release candidate cohorts:
    • HTTP updates and Vivox RC viewer, version 4.0.3.312816, dated March 23rd – probably the next viewer in line to be promoted to the de facto release status
    • Quick Graphics RC viewer, version 4.0.2.312297, dated March 11th – possibly to go through a further update (tests were being carried out with  the Avatar Complexity settings in week #12)
  • Project viewers:
    • Oculus Rift project viewer updated to version 3.7.18.295296 on October 13, 2015 – Oculus Rift DK2 support (download and release notes)
  • Obsolete platform viewer, version 3.7.28.300847, dated May 8th, 2015 – provided for users on Windows XP and OS X versions below 10.7.

Aditi Inventory Problems

As noted in part #2 of my last project update, there are issues with the new Aditi inventory syncing mechanism.

One issue is that items created on Aditi following one inventory syncing process will disappear from inventory when logging into Aditi following the next inventory syncing run (see BUG-11651).

This is likely the result of the viewer using the same cache, regardless of the grid you log-in to. The current fix is therefore to clear the viewer cache completely or to delete the inventory .gz files from your cache folder), and then log back into Aditi.

However, this approach in turn causes an issue of its own.

When logging back into Agni (the main grid) after clearing cache as described above, the Aditi assets will appear to be listed in your Agni inventory. However, any attempt to rez or wear or share the assets from Aditi will result in an error message, because the assets themselves are not physically part of your Agni inventory. Again, the solution is to clear cache  / remove the inventory .gz files from your viewer cache and re-log into Agni.

Also noted in the JIRA is this issue results in some very odd duplication of Calling Cards on Aditi.

The Solution

The desired fix is to have different inventory caches for each grid visited, and as noted in the JIRA report, this is how the Lab intends to proceed.

Invisiprims

As noted in the part #3 of my last project update, there is a new issue with invisiprims, which sees any object, worn or in-world, using the texture UUIDs associated with them rendered at a solid grey or black surface or object, regardless of whether ALM is enabled in the viewer or not. Prior to this issue occurring, the result of a change made in the current release viewer (version 4.0.2.312269), invisiprims would either mask whatever was behind them with ALM off, or simply be ignored if the viewer was running with ALM enabled.

The new invisiprim issue is that regradless of whether a viewer is running with ALM disabled (l) or enabled (r), worn or in-world objects using them now appear either solid grey or black (click image for full size, if required)
The new invisiprim issue is that regardless of whether a viewer is running with ALM disabled (l) or enabled (r), worn or in-world objects using them now appear either solid grey or black (click image for full size, if required)

As having grey surfaces and objects appearing on avatars in in-world (remembering that there is a lot of old, No Mod content in-world which makes extensive use of invisiprims and their associated textures, and this approach makes them look very unsightly to anyone viewing them), the suggestion has been put forward that the viewer should be modified to simply ignore the invisiprim texture UUIDs or treat them simply as “normal” transparent textures regardless of whether or not ALM is enabled in the viewer, and a fix has been submitted to the Lab to achieve this.

Asked during the Simulator User Group meeting on Tuesday, March 29th, if the Lab had reached a decision on adopting the fix, Simon Linden said, “We were talking about it earlier … nobody wants to do anything to break content; so we have the hole-in-the-water use, which is nice for boats and such.”

Oz Linden then added, “We’re going to do some testing of alternatives… so I guess the answer is that we don’t have a final decision yet.”

SL project updates 16 12/3: invisiprims

Invisiprims: as they were with ALM disabled (left) and ALM enabled (right) and as they appear now, with or without ALM enabled (LL official viewer)
Invisiprims: with ALM disabled (left) and ALM enabled (right); and as they appear now in the official viewer, with or without ALM enabled (click for full size, if required)

As noted in my week #11 update, the current release of the LL viewer now effectively “breaks” the remaining invisiprim capability in the viewer, with any object or surface using them rendered as either solid grey or black, something which is seen as less than optimal with regards to long-standing in-world content, prompting some debate as who should be done with invisiprims going forward.

To understand what has been discussed, and what is likely to be done, it is necessary to dip back into some history.

Background

One upon a time, Invisiprims were the means of achieving an alpha mask effect. For example, their use in footwear meant that an avatar’s feet could be masked to prevent them showing through shoes and boots. They could also be used in-world as well, a typical example being their use to mask Linden Water from being seen inside boat hulls or things like dry docks – one of the most famous examples being the dry dock at Nautilus (shown below).

As it used to be: the Nautilus dry dock uses an invisiprim to mask the Linden Water - but for the last few years this has onlt worked for viewers with Advanced Lighting Model (ALM) disabled
As it used to be: the Nautilus dry dock uses an invisiprim to mask the Linden Water. For the last several years, this has only worked when the Advanced Lighting Model (ALM) in the viewer is disabled

Invisiprims were able to do this by making use of two unique texture UUIDs within the viewer which, when called, would act as alpha masks. However, this always came as a cost to rendering, and could lead to unpredictable results (e.g. glitches with rendering, odd interactions between the invisiprim textures and other textures, etc.). Because these issues became particularly problematic when using some of the advanced rendering capabilities (what is now called the Advanced Lighting Model or ALM) in the viewer, a decision was taken a number of years ago to have ALM ignore the alpha masking effect of the invisiprim texture UUIDs.

Thus, anyone running the viewer with ALM enabled for the last several years  has not seen the masking effects of invisiprims; avatar body parts show through wearable items which use them, for example (hence the adoption of more efficient alpha layers by clothing and accessory designers). Nor do in-world invisiprims act a masks for things like Linden Water when viewed with ALM active (as illustrated below), although they would still alpha mask if ALM was disabled in the viewer.

As it has tended to be: the Nautilus dry dock uses an invisiprim to mask the Linden Water, the texture of which is completely ignored by the viewer when rendering with ALM enabled.
Following the changes made a few years ago to the Advanced Lighting Model, the “magic” invisiprim texture UUIDs are ignored during rendering, with the result that they no longer mask things like Linden Water when seen in a viewer with ALM enabled

While this latter point – the lack of ability to hide things like Linden Water from view – may have appeared less than perfect at the time the changes were made, it has over the ensuing years become accepted behaviour when seen in-world.  So what has now changed to once again make invisiprims a subject of discussion?

The New Problem and Its Proposed Solution

In short, a recent change to the viewer rendering system, (found in the current release viewer, 4.0.2.312269) means that anything using the invisiprim texture UUIDs is now seen as a sold grey or black surface / object regardless as to whether ALM is enabled in the viewer or not. This has led a lot of long-standing, No Mod in-world content looking distinctly odd and unsightly (shown below, again using the Nautilus dry dock).

The new invisiprim issue is that regradless of whether a viewer is running with ALM disabled (l) or enabled (r), worn or in-world objects using them now appear either solid grey or black (click image for full size, if required)
A change to the 4.0.2.312269 release viewer means that invisprims now render as solid grey or black surfaces / objects whether or not ALM is enabled in the viewer. With in-world content, this has led to some unsightly results, such as the Nautilus dry dock looking like it has been filled with cement (click image for full size, if required)

BUG-11562 was raised highlighting this latter impact to in-world content, with a request that the change be updated so that any surface using the “magic” invisiprim UUIDs is simply rendered as “invisible” (i.e. transparent, as is the case when running with ALM enabled). There has also been some debate among TPV developers about how to adopt the Lab’s code change, as well as the matter being discussed at both the Open-Source Developer meeting and the TPVD meeting held on March 25th, 2016 (audio extract below).

The latter discussions have resulted in both the Lab and TPV developers agreeing that the best solution would be to follow the BUG-11562 suggestion, and have surfaces and objects using the invisiprim UUIDs render and transparent objects whether or not ALM is enabled in the viewer.

A change to support this has already be submitted to the Lab to achieve this. Subject to further testing, it, or a solution similar to it, is likely to be integrated into a future viewer update.

Lab updates on unscheduled deployments and other issues

secondlifeAs noted in my recent SL project update, there was an unscheduled deployment to the three Agni (main) grid release candidate (RC) channels of Bluesteel, LeTigre and Magnum on Thursday, February 18th, which saw regions on these channels undergo a rolling restart. This was followed on Friday, February 19th by rolling restarts across the Main (SLS) channel.

During the Server Beta User Group (SBUG) meeting on Thursday, April 18th, Steven Linden provided some information on why a deployment was made to the RC channels, and indicated that a similar deployment would be forthcoming on the Main (SLS) channel, and promising further information would be provided once that deployment had been made:

We had an unscheduled RC deploy earlier today. It’s for a security vulnerability that was released, and we discovered that Second Life regions were vulnerable. A full public post-mortem will be coming after we deploy to the rest of the main grid. I can’t say until it goes out to the rest of Agni; I can say that it was related to region availability only…. I honestly can’t say a great deal, other than we have a fix, and that it’s coming very soon to the rest of Agni.

True to this promise, following the Main channel roll on Friday, February 19th, April Linden blogged Why the Friday Grid Roll?

The reason essentially boiled down to a vulnerability in the GNU version of Linux used to run the grid servers. The vulnerability lay within the GNU C library, commonly referred to as glibc, which if exploited could allow remote access to a devices – be it a computer, internet router, or other connected piece of equipment. It was initially discovered by Google on Tuesday, 16th February, and was labelled CVE-2015-7547.

April’s blog post provides a concise explanation of just what went into the Lab’s security and operations teams’ efforts in ascertaining SL’s exposure to the vulnerability and developing an update to secure their servers against the vulnerability.

All of this took time – but all things considered, it was still a remarkably fast effort. The Lab went from hearing about the risk on Tuesday 16th February through to understanding the full extent of the possible exposure SL faced, to having an update coded, tested and ready for release by Thursday, which as April explained, then left them with another decision:

Do we want to roll the code to the full grid at once? We decided that since the updates were to one of the most core libraries, we should be extra careful, and decided to roll the updates to the Release Candidate (RC) channels first. That happened on Thursday morning.

Given the Lab wanted to monitor how things progressed on the RC channels (which between them represent roughly 30% of the total grid), and ensure the update itself didn’t introduce anything unexpected. So it was that the deployment to the rest of the grid couldn’t be made until Friday, February 19th.

April emphasises that at no point during the known period of exposure or before, was there any attempt to use the vulnerability against the SL servers.  At the time of the Thursday roll, there was some criticism directed at the Lab for the lack of warning. April also explains why this was the case:

The reason there was little notice for the roll on Thursday is two-fold. First, we were moving very quickly, and second because the roll was to mitigate a security issue, we didn’t want to tip our hand and show what was going on until after the issue had been fully resolved.

When things like unscheduled rolls are disruptive, leaving us prone to grumbling and pointing the finger, it’s perhaps worthwhile taking this incident as an example that sometimes, there are reasons why the Lab does announced things first.

April’s post is actually one of three published recently by the operations / engineering teams which provide interesting insight into what goes on behind the scenes in keeping Second Life running.

In Recent Issues with the Nightly Biller, Steven Linden provides and explanation on why some Premium members recently experienced billing issues, up to and including inadvertently receiving delinquent balance notices. Once again, the explanation of what happened and what has been done to try to ensure a similar problem doesn’t occur in the future makes for a worthwhile read.

In Tale of the Missing ACK, Chris Linden describes another unusual and challenging incident the Lab’s engineering team had to deal with when testing a new API endpoint hosted in Amazon. This again illustrates the overall complexity of the Second Life services and infrastructure, which extends far beyond the simulator servers we some often take for granted as being “the” SL service, and the complexities involved in tracking issues down when things don’t go as expected  / planned.

Thanks again to April, Steven and Chris for providing the explanations and the insight into SL’s services.