Update, September 11th, 2019: The fixes for this issue have been deployed to regions on the LeTigre and Magnum RC channels in server deployment 2019-09-06T22:03:53.530715. Those wishing to test the fixes, and whose regions / experiences are not on either of these channels can file a support ticket to have their region moved. Use Help > About in the viewer to check the simulator version number running for your experience.
In my last few Simulator User Group (SUG) meeting updates, I’ve references issues being encountered by experience creators since a recent server-side deployment.
In short, in the last couple of weeks, any scripts compiled to an experience have failed to recompile. The finger had been pointed at server deployment 19.08.06.529800 being at fault.
However, the Lab has been engaged in fault-finding and attempts at rectifying the problem, and their work has revealed that the fault does not lie with any particularly server release, as an official blog post issued on Thursday, September 5th explains:
We have traced the problem to a loss of data in one of our internal systems.
This data loss was due to human error rather than any change to server software. Why do we think this is good news? Because we can now easily prevent it from happening in the future.
We have engaged in a first pass of recovery efforts which have yielded the restoration of the experience association for a number of scripts, and we are testing a server-based fix which will automatically correct most others. That fix is working its way through QA, and we will highlight this in the server release notes when it becomes available.
For those who have been impacted by the issue, the blog provides a set of step to take to correct matters should they not wish to wait for the back-end fix:
Open the script in an object in-world or attached to you .
Make sure the bottom widgets have your experience selected.
These step should be enough to get experience enabled scripts running again.
The week of May 13th-17th saw a planned period of Second Life network maintenance work, as announced in the Grid Status updates.
The first tranche of this work – Monday, May 13th through Tuesday May 14th – appeared to go well, until there was a completely unexpected 4(ish) hours of downtime, which at the time caused significant upset.
On May 17th, April Linden, the Second Life Operations Manager, has provided an insightful blog post on both the work being carried out and the cause of the downtime.
This week we were doing much needed maintenance on the network that powers Second Life. The core routers that connect our data centre to the Internet were nearing their end-of-life, and needed to be upgraded to make our cloud migration more robust.
Replacing the core routers on a production system that’s in very active use is really tricky to get right. We were determined to do it correctly, so we spent over a month planning all of the things we were going to do, and in what order, including full roll-back plans at each step. We even hired a very experienced network consultant to work with us to make sure we had a really good plan in place, all with the goal of interrupting Second Life as little as we could while improving it …
Everything started out great. We got the first new core router in place and taking traffic without any impact at all to the grid. When we started working on the second core router, however, it all went wrong.
– Extract from April Linden’s blog post
In essence, a cable had to be relocated, which was expected to cause a very brief period of impact. However, things didn’t recover as anticipated, and April resumes her explanation:
After the shock had worn off we quickly decided to roll back the step that failed, but it was too late. Everyone that was logged into Second Life at the time had been logged out all at once. Concurrency across the grid fell almost instantly to zero. We decided to disable logins grid-wide and restore network connectivity to Second Life as quickly as we could.
At this point we had a quick meeting with the various stakeholders, and agreed that since we were down already, the right thing to do was to press on and figure out what happened so that we could avoid it happening again…
This is why logins were disabled for several hours. We were determined to figure out what had happened and fix the issue, because we very much did not want it to happen again. We’ve engineered our network in a way that any piece can fail without any loss of connectivity, so we needed to dig into this failure to understand exactly what happened.
– Extract from April Linden’s blog post
In other words, while it may have been painful for those who were unceremoniously dumped from Second Life and found they could not get back in, the Lab were working with the best of intentions: trying to find out exactly why connectivity was lost within a network where such an event should not cause such a drastic breakage – and its worth noting that as per April’s blog post, even the engineers from the manufacturer of the Lab’s network equipment were perplexed by what happened.
As always, April’s blog post makes for an invaluable read in understanding some of the complexities of Second Life, and goes so far as to answer a question raised on the forums in the wake of the week’s problems: Why didn’t LL tell us exactly when this maintenance was going to happen? – in short there are bad actors in the world who could make use of publicly available announcements that give them precise information on when a network might be exposed.
If you’ve not read April’s blog posts on operational issues like this, I really cannot recommend them enough – and thanks are again offered April for providing this post. And while things might have hurt at the time, there is a silver lining to things, as she notes:
Second Life is now up and running with new core routers that are much more powerful than anything we’ve had before, and we’ve had a chance to do a lot of failure testing. It’s been a rough week, but the grid is in better shape as a result.
Just a reminder (or advanced warning for those who may not have seen it): Second Life will be subject to up to 4 days of network maintenance, commencing on Monday, May 13th. This work may possibly run through until Thursday, May 16th.
Our engineers will be performing maintenance on the Second Life network May 13 – 16. We hope to perform most of the maintenance early in this window, but it may extend several days if needed.
Residents may experience problems connecting to, being disconnected from, or an inability to log in during this time, as well as possible issues rezzing objects, teleporting, or crossing regions. We hope to keep these disruptions to a minimum and greatly appreciate your patience during this time as we work to make Second Life more robust.
We will resolve this status once the maintenance has been fully completed.
So, if you do experience issues at the start of, or during the week, be sure to keep an eye on the Grid Status pages for updates to this announcement.
As we’re all (probably painfully) aware, the last few months have seen Second Life plague by region crossing issues, with users frequently disconnected (with teleports – being the most common form of region crossing – in particular being affected). One of the pains in dealing with these issues has been identifying the root cause – with most thinking being around it being a timing issue with communications between the region receiving and incoming avatar and the user’s viewer.
However, speaking at the Content Creation User Group meeting on Thursday, April 18th, Vir Linden indicated that the problem might be related to the server Linux operating system update the Lab recently rolled out.
That update – was initially deployed to a small cluster of regions on a release candidate channel called Cake, and it has been reported by those using Cake regions for testing in April, that it was those regions that first demonstrated the teleport issues – although at the time, they were thought to be local connection issues, rather than indicative of a deeper potential issue.
Commenting on the situation at the CCUG meeting, Vir said:
We’ve been having some issues on the simulator side where people tend to get disconnected during teleports … it’s been common enough that shows up as a significant blip on our stats … and that issue seems to have come along … basically when we upgraded the version of Linux that we’re using on our simulators. so we’ve had to do some roll-backs there, just to try to get that issue to go away.
[But] that pushes out the time-line for [deploying] all the things that are based on … the later version [of Linux] that we’re trying to update to … Hopefully we can get those out soon, but I can’t tell you anything about the time-line.
This might explain the scheduled maintenance witnessed on April 18th, with large number of regions going off-line and restarted. If this is the reason, whether it does see a reduction in the teleport issues with those regions rolled-back remains to be seen. But if data does indicate the region crossing issues have been reduced, then this can only be good news and potentially worth the disruption of the maintenance and restarts.
In the meantime, the audio of Vir’s comments is provided below.
Various theories have popped up over the weeks as to why the problem is occurring – with fingers most often being pointed at the server-side deployment of the Environment Enhancement Project (EEP). Whether or not EEP is responsible or not is hard to judge.
As I noted in my March 30th TPVD meeting notes, one of the problems with the issues is that they appear to strike randomly, and cannot be reproduced with any consistency; were a single cause behind them, it’s not unreasonable to assume that investigations would lead to a point where some degree of reproduction could be manifested.
It has been suggested by some users that de-rendering the sky (CTRL-ALT-SHIFT-6) before a teleport attempt can apparently ease the issue – although this is hardly a fix (and certainly no help to aviators and sailors), nor does it appear to work in all cases.
As trying to get to the root cause(s) of the problem is taking time, on Monday, April 8th, Linden Lab issued a blog post of their own on the matter, which reads in full:
Many Residents have noted that in the last few weeks we have had an increase in disconnects during a teleport. These occur when an avatar attempts to teleport to a new Region (or cross a Region boundary, which is handled similarly internally) and the teleport or Region crossing takes longer than usual. Instead of arriving at the expected destination, the viewer disconnects with a message like:
Darn. You have been logged out of Second Life.
You have been disconnected from the region you were in.
We do not currently believe that this is specific to any viewer, and it can affect any pair of Regions (it seems to be a timing-sensitive failure in the hand-off between one simulator and the next). There is no known workaround – please continue logging back in to get where you were going in the meantime.
We are very much aware of the problem, and have a crack team trying to track it down and correct it. They’re putting in long hours and exploring all the possibilities. Quite unfortunately, this problem dodged our usual monitors of the behaviour of simulators in the Release Channels, and as a result we’re also enhancing those monitors to prevent similar problems getting past us in the future.
We’re sorry about this – we empathise with how disruptive it has been.
As noted, updates are being provided as available, through the various related User Group meetings, and I’ll continue to endeavour to reflect these through my relevant User Group meeting updates.
In my week #44/1 User Group update, I noted that April Linden had indicated the issues Second Life users experienced with the platform on Sunday, October 28th through Monday October 29th, 2018 were the result of a Distributed Denial of Service (DDOS) attack.
April has now issued a blog post expanding on her original forum comments, with the full text of her post reading:
Hello amazing Residents of Second Life!
A few days ago (on Sunday, October 28th, 2018) we had a really rough day on the grid. For a few hours it was nearly impossible to be connected to Second Life at all, and this repeated several times during the day.
The reason this happened is that Second Life was being DDoSed.
Attacks of this type are pretty common. We’re able to handle nearly all of them without any Resident-visible impact to the grid, but the attacks on Sunday were particularly severe. The folks who were on call this weekend did their best to keep the grid stable during this time, and I’m grateful they did.
Sunday is our busiest day in Second Life each week, and we know there’s lot of events folks plan during it. We’re sorry those plans got interrupted. Like most of y’all, I too have an active life as a Resident, and my group had to work around the downtime as well. It was super frustrating.
As always, the place to stay informed of what’s going on is the Second Life Grid Status Blog. We do our best to keep it updated during periods of trouble on the grid.
Thanks for listening. I’ll see you in-world!
April Linden Second Life Operations Team Lead
There not a lot more that can be added – DDOS attacks are an unfortunate fact of life, and while the Lab has learned to try to deal with them without impacting the normal flow of activities for Second Life users, it’s also unfortunate that at time this cannot always be the case.
Thanks once again to April for the update on the situation.