Second Life: parent/child script communication issues

On Friday, October 4th, 2019 Linden Lab blogged about the recent script related issues that caused widespread disruption (notably with rezzing systems) across Second Life following the SLS (Main) channel deployment made on Tuesday, September 24th, 2019, and which ultimately resulted in a complete rollback from the grid on the 27th/28th September.

As noted in my Simulator User Group Updates, the release that caused the problems  – simulator release 2019-09-06T22:03:53.530715, included a number of updates intended to improve overall script performance, including how scripts are scheduled and events are delivered. However, these changes resulted in an unintended impact which, due to the region sampling, was not revealed by the update initially being deployed to a release candidate (RC) channel on Wednesday, September 11th.

The October 4th blog post from Linden Lab indicates that improvements have been made to the code, and once deployed, these should help prevent a recurrence of the problem. As an aside, it has been hoped that these updates might have been deployed to an RC channel on Wednesday, October 2nd, but a last minute bug prevented this (see: Deploy Plan for the week of 2019-09-30), so the updates will likely be deployed during week #41 (commencing Monday, October 7th).

However, even with the fixes, there blog post goes on to note there are come best practices when using parent / child script communications between a parent object and a child it rezzes:

One common cause of problems is communication between objects immediately after one creates the other. When an object rezzes another object in-world using llRezObject or llRezAtRoot, the two objects frequently want to communicate, such as through calls to llRegionSayTo or llGiveInventory. The parent object receives an object_rez() event when the new object has been created, but it is never safe to assume that scripts in the new object have had a chance to run when the object_rez event is delivered. This means that the new object may not have initialised its listen() event or called llAllowInventoryDrop, so any attempt to send it messages or inventory could fail. The parent object should not begin sending messages or giving inventory from the object_rez() event, or even rely on waiting some time after that event. Instead, the parent(rezzer) and the child(rezzee) should perform a handshake to confirm that both sides are ready for any transfer. 

The blog post goes on to define the sequence of events between a parent and rezzed child object as they should occur, and provides sample code for such parent / child operations.

An important point to note with this is that when the fix from the Lab is re-deployed, any scripts that still exhibit these kinds of communication issues will likely need to be altered by their creator to match the recommendations provided by the blog post.

Those wishing to know more are invited to read the original blog post in full, and address and questions and / or feedback through the associated forum thread.

Lab blogs on experience scripts issue fix / workaround

Update, September 11th, 2019: The fixes for this issue have been deployed to regions on the LeTigre and Magnum RC channels in server deployment 2019-09-06T22:03:53.530715. Those wishing to test the fixes, and whose regions / experiences are not on either of these channels can file a support ticket to have their region moved. Use Help > About in the viewer to check the simulator version number running for your experience.

In my last few Simulator User Group (SUG) meeting updates, I’ve references issues being encountered by experience creators since a recent server-side deployment.

In short, in the last couple of weeks, any scripts compiled to an experience have failed to recompile. The finger had been pointed at server deployment 19.08.06.529800 being at fault.

However, the Lab has been engaged in fault-finding and attempts at rectifying the problem, and their work has revealed that the fault does not lie with any particularly server release, as an official blog post issued on Thursday, September 5th explains:

We have traced the problem to a loss of data in one of our internal systems. 

This data loss was due to human error rather than any change to server software. Why do we think this is good news? Because we can now easily prevent it from happening in the future. 

We have engaged in a first pass of recovery efforts which have yielded the restoration of the experience association for a number of scripts, and we are testing a server-based fix which will automatically correct most others. That fix is working its way through QA, and we will highlight this in the server release notes when it becomes available.

For those who have been impacted by the issue, the blog provides a set of step to take to correct matters should they not wish to wait for the back-end fix:

  1. Open the script in an object in-world or attached to you .
  2. Make sure the bottom widgets have your experience selected.
  3. Save.

These step should be enough to get experience enabled scripts running again.

April Linden blogs on the May 13th/14th downtime

The week of May 13th-17th saw a planned period of Second Life network maintenance work, as announced in the Grid Status updates.

The first tranche of this work – Monday, May 13th through Tuesday May 14th – appeared to go well, until there was a completely unexpected 4(ish) hours of downtime, which at the time caused significant upset.

On May 17th, April Linden, the Second Life Operations Manager, has provided an insightful blog post on both the work being carried out and the cause of the downtime.

This week we were doing much needed maintenance on the network that powers Second Life. The core routers that connect our data centre to the Internet were nearing their end-of-life, and needed to be upgraded to make our cloud migration more robust.

Replacing the core routers on a production system that’s in very active use is really tricky to get right. We were determined to do it correctly, so we spent over a month planning all of the things we were going to do, and in what order, including full roll-back plans at each step. We even hired a very experienced network consultant to work with us to make sure we had a really good plan in place, all with the goal of interrupting Second Life as little as we could while improving it …

Everything started out great. We got the first new core router in place and taking traffic without any impact at all to the grid. When we started working on the second core router, however, it all went wrong.

– Extract from April Linden’s blog post

In essence, a cable had to be relocated, which was expected to cause a very brief period of impact. However, things didn’t recover as anticipated, and April resumes her explanation:

After the shock had worn off we quickly decided to roll back the step that failed, but it was too late. Everyone that was logged into Second Life at the time had been logged out all at once. Concurrency across the grid fell almost instantly to zero. We decided to disable logins grid-wide and restore network connectivity to Second Life as quickly as we could.
At this point we had a quick meeting with the various stakeholders, and agreed that since we were down already, the right thing to do was to press on and figure out what happened so that we could avoid it happening again…
This is why logins were disabled for several hours. We were determined to figure out what had happened and fix the issue, because we very much did not want it to happen again. We’ve engineered our network in a way that any piece can fail without any loss of connectivity, so we needed to dig into this failure to understand exactly what happened.

– Extract from April Linden’s blog post

April Linden

In other words, while it may have been painful for those who were unceremoniously dumped from Second Life and found they could not get back in, the Lab were working with the best of intentions: trying to find out exactly why connectivity was lost within a network where such an event should not cause such a drastic breakage – and its worth noting that as per April’s blog post, even the engineers from the manufacturer of the Lab’s network equipment were perplexed by what happened.

As always, April’s blog post makes for an invaluable read in understanding some of the complexities of Second Life, and goes so far as to answer a question raised on the forums in the wake of the week’s problems: Why didn’t LL tell us exactly when this maintenance was going to happen? – in short there are bad actors in the world who could make use of publicly available announcements that give them precise information on when a network might be exposed.

If you’ve not read April’s blog posts on operational issues like this, I really cannot recommend them enough – and thanks are again offered April for providing this post. And while things might have hurt at the time, there is a silver lining to things, as she notes:

Second Life is now up and running with new core routers that are much more powerful than anything we’ve had before, and we’ve had a chance to do a lot of failure testing. It’s been a rough week, but the grid is in better shape as a result.

SL Maintenance reminder: May 13-16th 2019

Just a reminder (or advanced warning for those who may not have seen it): Second Life will be subject to up to 4 days of network maintenance, commencing on Monday, May 13th. This work may possibly run through until Thursday, May 16th.

The details are available on the Second Life Grid Status pages, but are reproduced in full below:

Our engineers will be performing maintenance on the Second Life network May 13 – 16. We hope to perform most of the maintenance early in this window, but it may extend several days if needed.

Residents may experience problems connecting to, being disconnected from, or an inability to log in during this time, as well as possible issues rezzing objects, teleporting, or crossing regions. We hope to keep these disruptions to a minimum and greatly appreciate your patience during this time as we work to make Second Life more robust.

We will resolve this status once the maintenance has been fully completed.

So, if you do experience issues at the start of, or during the week, be sure to keep an eye on the Grid Status pages for updates to this announcement.

Linux OS update to servers a cause of SL TP issues?

As we’re all (probably painfully) aware, the last few months have seen Second Life plague by region crossing issues, with users frequently disconnected (with teleports – being the most common form of region crossing – in particular being affected). One of the pains in dealing with these issues has been identifying the root cause – with most thinking being around it being a timing issue with communications between the region receiving and incoming avatar and the user’s viewer.

However, speaking at the Content Creation User Group meeting on Thursday, April 18th, Vir Linden indicated that the problem might be related to the server Linux operating system update the Lab recently rolled out.

That update – was initially deployed to a small cluster of regions on a release candidate channel called Cake, and it has been reported by those using Cake regions for testing in April, that it was those regions that first demonstrated the teleport issues – although at the time, they were thought to be local connection issues, rather than indicative of a deeper potential issue.

Commenting on the situation at the CCUG meeting, Vir said:

We’ve been having some issues on the simulator side where people tend to get disconnected during teleports … it’s been common enough that shows up as a significant blip on our stats … and that issue seems to have come along … basically when we upgraded the version of Linux that we’re using on our simulators. so we’ve had to do some roll-backs there, just to try to get that issue to go away.

[But] that pushes out the time-line for [deploying] all the things that are based on … the later version [of Linux] that we’re trying to update to … Hopefully we can get those out soon, but I can’t tell you anything about the time-line.

This might explain the scheduled maintenance witnessed on April 18th, with large number of regions going off-line and restarted. If this is the reason, whether it does see a reduction in the teleport issues with those regions rolled-back remains to be seen. But if data does indicate the region crossing issues have been reduced, then this can only be good news and potentially worth the disruption of the maintenance and restarts.

In the meantime, the audio of Vir’s comments is provided below.

 

Second Life: teleport / region crossing disconnects

As I’ve been reporting in my weekly Simulator User Group meeting summaries and my Third-Party Viewer Developer meeting updates, there have been widespread issues with disconnects during region crossings – both via teleport and physical region crossings (e.g. via boat or aircraft or on foot).

Various theories have popped up over the weeks as to why the problem is occurring  – with fingers most often being pointed at the server-side deployment of the Environment Enhancement Project (EEP). Whether or not EEP is responsible or not is hard to judge.

As I noted in my March 30th TPVD meeting notes, one of the problems with the issues is that they appear to strike randomly, and cannot be reproduced with any consistency; were a single cause behind them, it’s not unreasonable to assume that investigations would lead to a point where some degree of reproduction could be manifested.

It has been suggested by some users that de-rendering the sky (CTRL-ALT-SHIFT-6) before a teleport attempt can apparently ease the issue – although this is hardly a fix (and certainly no help to aviators and sailors), nor does it appear to work in all cases.

As trying to get to the root cause(s) of the problem is taking time, on Monday, April 8th, Linden Lab issued a blog post of their own on the matter, which reads in full:

Many Residents have noted that in the last few weeks we have had an increase in disconnects during a teleport. These occur when an avatar attempts to teleport to a new Region (or cross a Region boundary, which is handled similarly internally) and the teleport or Region crossing takes longer than usual.  Instead of arriving at the expected destination, the viewer disconnects with a message like:

Darn. You have been logged out of Second Life.

You have been disconnected from the region you were in.

We do not currently believe that this is specific to any viewer, and it can affect any pair of Regions (it seems to be a timing-sensitive failure in the hand-off between one simulator and the next).  There is no known workaround – please continue logging back in to get where you were going in the meantime.

We are very much aware of the problem, and have a crack team trying to track it down and correct it. They’re putting in long hours and exploring all the possibilities. Quite unfortunately, this problem dodged our usual monitors of the behaviour of simulators in the Release Channels, and as a result we’re also enhancing those monitors to prevent similar problems getting past us in the future.

We’re sorry about this – we empathise with how disruptive it has been.

As noted, updates are being provided as available, through the various related User Group meetings, and I’ll continue to endeavour to reflect these through my relevant User Group meeting updates.