April reports on the SL Marketplace mix-up

On November 4th, some users on the Marketplace who accessed their account page ended up seeing some account details for another user currently logged-in to the Marketplace at the same time.

The user account page gives a user’s SL account name, L$ balance, a small portion of their activity on the  Marketplace activity, their wish lists, received gifts list, and the obfuscated version of their e-mail address (e.g. i****@g****.com, designed to provide the user with enough information to identify their own e-mail address without revealing it to others).

Multiple bug reports on the issue were raised with Linden Lab, and at least one forum thread was raised on the subject, with some pointing to the Marketplace maintenance that was in progress as a possible cause – and they were right, as the Lab’s Second Life Operations Manager has revealed in a blog post (Report on the Recent Marketplace Issue), that reads in part:

We’ve been working to make the Second Life Marketplace more robust and handle higher numbers of page views at once. Due to a change made this morning, the user account page got cached when we didn’t mean for it to be. Once we realised what had happened, we rolled back the changes immediately and deleted all of our caches. No other parts of Second Life were impacted.

Our engineering teams are now working with our QA (quality assurance) team to make sure we develop better testing for this in the future. We want to make sure we catch something like this long before it makes it out into the hands of Residents.

We’d like to extend a really big thank you to everyone who reported the issue to us the moment they saw it! Because of your vigilance we were able to react really quickly and limit the time that this misconfiguration was live. (Seriously, y’all rock! 💜)

We’re sorry this issue happened this morning. We’re working to make sure it never happens again, and developing better test procedures for use in the future.

While the error was unfortunate, and might have been a little discomforting for some who encountered it, the Lab estimates that no more than 500 users visited the account page during the time the issue could occur, and not all of them were given the wrong page to view.

Where the issue did occur, April notes that it did so at random, and randomly selected the incorrect page to be displayed, so it was impossible for a user to “pick” another user’s information and intentionally view it. She also notes that it was not possible to either make purchases via an incorrect account page, or to make any changes to the page.

As always, details in full in April’s blog post – and many thanks to her again for providing an explanation of the issue and what is being done to hopefully avoid future repetitions.

April offers a look at the October 2019 woes

The period of Thursday, October 24th through Sunday 27th October, 2019 saw Second Life encounter a rolling set of issues which finally came to a head on Sunday, October 27th. The issues affected many Second Life users and services from logging-in through to inventory / asset handling.

As has become the case with these matters, April Linden, the Second Life Operations Manager, has provided a post-mortem blog post on the issue and her team’s work in addressing the problems. And as always, her post provides insight into the complexities in keeping a platform such as Second Life running.

In short, the root cause of the weekend’s upsets lay not with and of the Second Life services but with one of the Lab’s network providers – and was exacerbated by the fact the first couple of times it happened – Thursday and Friday – it appeared to correct itself on both occasions before the Lab could fully identify the root cause.

April Linden

On Sunday, the problems started up again, but fortunately April’s team were able to pin down the issue and commence work with their provider – which obviously meant getting Second Life back on an even keel was pretty much in the hands of a third-party rather than being fully under the Lab’s control.

Our stuff was (and still is) working just fine, but we were getting intermittent errors and delays on traffic that was routed through one of our providers. We quickly opened a ticket with the network provider and started engaging with them. That’s never a fun thing to do because these are times when we’re waiting on hold on the phone with a vendor while Second Life isn’t running as well as it usually does.

After several hours trying to troubleshoot with the vendor, we decided to swing a bigger hammer and adjust our Internet routing. It took a few attempts, but we finally got it, and we were able to route around the problematic network. We’re still trying to troubleshoot with the vendor, but Second Life is back to normal again.

– Extract from April Linden’s blog post

As a result of the problems April’s team is working on moving some of the Lab’s services to make Second Life more resilient to similar incidents.

During the issues, some speculated if the problems were a result of the power outages being experienced in California at the time. As April notes, this was not the case – while Linden Lab’s head office is in San Francisco, the core servers and services are located in Arizona. However, resolving the issues from California were affected by the outages, again as April notes in her post.

It’s something I’ve note before, and will likely state again: feedback like this from April, laying out what happened when SL encounters problems are always an educational  / invaluable read, not only explaining the issue itself, but in also providing worthwhile insight into the complexities of Second Life.

Lab blogs on parent/child script communication issues

© and ™ Linden Lab

On Friday, October 4th, 2019 Linden Lab blogged about the recent script related issues that caused widespread disruption (notably with rezzing systems) across Second Life following the SLS (Main) channel deployment made on Tuesday, September 24th, 2019, and which ultimately resulted in a complete rollback from the grid on the 27th/28th September.

As noted in my Simulator User Group Updates, the release that caused the problems  – simulator release 2019-09-06T22:03:53.530715, included a number of updates intended to improve overall script performance, including how scripts are scheduled and events are delivered. However, these changes resulted in an unintended impact which, due to the region sampling, was not revealed by the update initially being deployed to a release candidate (RC) channel on Wednesday, September 11th.

The October 4th blog post from Linden Lab indicates that improvements have been made to the code, and once deployed, these should help prevent a recurrence of the problem. As an aside, it has been hoped that these updates might have been deployed to an RC channel on Wednesday, October 2nd, but a last minute bug prevented this (see: Deploy Plan for the week of 2019-09-30), so the updates will likely be deployed during week #41 (commencing Monday, October 7th).

However, even with the fixes, there blog post goes on to note there are come best practices when using parent / child script communications between a parent object and a child it rezzes:

One common cause of problems is communication between objects immediately after one creates the other. When an object rezzes another object in-world using llRezObject or llRezAtRoot, the two objects frequently want to communicate, such as through calls to llRegionSayTo or llGiveInventory. The parent object receives an object_rez() event when the new object has been created, but it is never safe to assume that scripts in the new object have had a chance to run when the object_rez event is delivered. This means that the new object may not have initialised its listen() event or called llAllowInventoryDrop, so any attempt to send it messages or inventory could fail. The parent object should not begin sending messages or giving inventory from the object_rez() event, or even rely on waiting some time after that event. Instead, the parent(rezzer) and the child(rezzee) should perform a handshake to confirm that both sides are ready for any transfer. 

The blog post goes on to define the sequence of events between a parent and rezzed child object as they should occur, and provides sample code for such parent / child operations.

An important point to note with this is that when the fix from the Lab is re-deployed, any scripts that still exhibit these kinds of communication issues will likely need to be altered by their creator to match the recommendations provided by the blog post.

Those wishing to know more are invited to read the original blog post in full, and address and questions and / or feedback through the associated forum thread.

Lab blogs on experience scripts issue fix / workaround

Update, September 11th, 2019: The fixes for this issue have been deployed to regions on the LeTigre and Magnum RC channels in server deployment 2019-09-06T22:03:53.530715. Those wishing to test the fixes, and whose regions / experiences are not on either of these channels can file a support ticket to have their region moved. Use Help > About in the viewer to check the simulator version number running for your experience.

In my last few Simulator User Group (SUG) meeting updates, I’ve references issues being encountered by experience creators since a recent server-side deployment.

In short, in the last couple of weeks, any scripts compiled to an experience have failed to recompile. The finger had been pointed at server deployment 19.08.06.529800 being at fault.

However, the Lab has been engaged in fault-finding and attempts at rectifying the problem, and their work has revealed that the fault does not lie with any particularly server release, as an official blog post issued on Thursday, September 5th explains:

We have traced the problem to a loss of data in one of our internal systems. 

This data loss was due to human error rather than any change to server software. Why do we think this is good news? Because we can now easily prevent it from happening in the future. 

We have engaged in a first pass of recovery efforts which have yielded the restoration of the experience association for a number of scripts, and we are testing a server-based fix which will automatically correct most others. That fix is working its way through QA, and we will highlight this in the server release notes when it becomes available.

For those who have been impacted by the issue, the blog provides a set of step to take to correct matters should they not wish to wait for the back-end fix:

  1. Open the script in an object in-world or attached to you .
  2. Make sure the bottom widgets have your experience selected.
  3. Save.

These step should be enough to get experience enabled scripts running again.

April Linden blogs on the May 13th/14th downtime

The week of May 13th-17th saw a planned period of Second Life network maintenance work, as announced in the Grid Status updates.

The first tranche of this work – Monday, May 13th through Tuesday May 14th – appeared to go well, until there was a completely unexpected 4(ish) hours of downtime, which at the time caused significant upset.

On May 17th, April Linden, the Second Life Operations Manager, has provided an insightful blog post on both the work being carried out and the cause of the downtime.

This week we were doing much needed maintenance on the network that powers Second Life. The core routers that connect our data centre to the Internet were nearing their end-of-life, and needed to be upgraded to make our cloud migration more robust.

Replacing the core routers on a production system that’s in very active use is really tricky to get right. We were determined to do it correctly, so we spent over a month planning all of the things we were going to do, and in what order, including full roll-back plans at each step. We even hired a very experienced network consultant to work with us to make sure we had a really good plan in place, all with the goal of interrupting Second Life as little as we could while improving it …

Everything started out great. We got the first new core router in place and taking traffic without any impact at all to the grid. When we started working on the second core router, however, it all went wrong.

– Extract from April Linden’s blog post

In essence, a cable had to be relocated, which was expected to cause a very brief period of impact. However, things didn’t recover as anticipated, and April resumes her explanation:

After the shock had worn off we quickly decided to roll back the step that failed, but it was too late. Everyone that was logged into Second Life at the time had been logged out all at once. Concurrency across the grid fell almost instantly to zero. We decided to disable logins grid-wide and restore network connectivity to Second Life as quickly as we could.

At this point we had a quick meeting with the various stakeholders, and agreed that since we were down already, the right thing to do was to press on and figure out what happened so that we could avoid it happening again…

This is why logins were disabled for several hours. We were determined to figure out what had happened and fix the issue, because we very much did not want it to happen again. We’ve engineered our network in a way that any piece can fail without any loss of connectivity, so we needed to dig into this failure to understand exactly what happened.

– Extract from April Linden’s blog post

April Linden

In other words, while it may have been painful for those who were unceremoniously dumped from Second Life and found they could not get back in, the Lab were working with the best of intentions: trying to find out exactly why connectivity was lost within a network where such an event should not cause such a drastic breakage – and its worth noting that as per April’s blog post, even the engineers from the manufacturer of the Lab’s network equipment were perplexed by what happened.

As always, April’s blog post makes for an invaluable read in understanding some of the complexities of Second Life, and goes so far as to answer a question raised on the forums in the wake of the week’s problems: Why didn’t LL tell us exactly when this maintenance was going to happen? – in short there are bad actors in the world who could make use of publicly available announcements that give them precise information on when a network might be exposed.

If you’ve not read April’s blog posts on operational issues like this, I really cannot recommend them enough – and thanks are again offered April for providing this post. And while things might have hurt at the time, there is a silver lining to things, as she notes:

Second Life is now up and running with new core routers that are much more powerful than anything we’ve had before, and we’ve had a chance to do a lot of failure testing. It’s been a rough week, but the grid is in better shape as a result.

SL Maintenance reminder: May 13-16th 2019

Just a reminder (or advanced warning for those who may not have seen it): Second Life will be subject to up to 4 days of network maintenance, commencing on Monday, May 13th. This work may possibly run through until Thursday, May 16th.

The details are available on the Second Life Grid Status pages, but are reproduced in full below:

Our engineers will be performing maintenance on the Second Life network May 13 – 16. We hope to perform most of the maintenance early in this window, but it may extend several days if needed.

Residents may experience problems connecting to, being disconnected from, or an inability to log in during this time, as well as possible issues rezzing objects, teleporting, or crossing regions. We hope to keep these disruptions to a minimum and greatly appreciate your patience during this time as we work to make Second Life more robust.

We will resolve this status once the maintenance has been fully completed.

So, if you do experience issues at the start of, or during the week, be sure to keep an eye on the Grid Status pages for updates to this announcement.