Tuesday, August 22nd was not a particularly good day for Second Life, with an extended period of unscheduled maintenance with log-ins suspended and those in-world advised to refraining from rezzing No Copy objects, or making any LindeX related transactions, etc.
If these words sound familiar (except the date), it’s because I wrote them a year ago to the day, on August 23rd, 2016, when Second Life experienced some significant issues.
Back then, the problem was the core database. The initial problems on August 22nd, 2017 weren’t software related, nor were they related to the Main (SLS) channel deployment taking place at the time. Instead, they lay with a piece of hardware, as April Linden, writing in the Tools and Technology blog, explained in another concise explanation of the problem, which started:
Early this morning (during the grid roll, but it was just a coincidence) we had a piece of hardware die on our internal network. When this piece of hardware died, it made it very difficult for the servers on the grid to figure out how to convert a human-readable domain name, like www.secondlife.com, into IP addresses, like 216.82.8.56.
Everything was still up and running, but none of the computers could actually find each other on our network, so activity on the grid ground to a halt. The Second Life grid is a huge collection of computers, and if they can’t find other, things like switching regions, teleports, accessing your inventory, changing outfits, and even chatting fail. This caused a lot of Residents to try to relog.
We quickly rushed to get the hardware that died replaced, but hardware takes time – and in this case, it was a couple of hours. It was very eerie watching our grid monitors. At one point the “Logins Per Minute” metric was reading “1,” and the “Percentage of Successful Teleports” was reading “2%.” I hope to never see numbers like this again.
Unfortunately, as April went on to explain, the problems didn’t end there, as the log-in service got into something of a mismatch once the hardware issue had been resolved. Whilst telling viewers attempting to log-in to the grid their attempts were unsuccessful, the service was telling the simulators the log-ins had been successful. Things didn’t start returning to normal once this issue had been corrected.
There is some good news coming out of this latter situation however, as April goes on to note in the blog post:
We are currently in the middle of testing our next generation login servers, which have been specifically designed to better withstand this type of failure. We’ve had a few of the next generation login servers in the pool for the last few days just to see how they handle actual Resident traffic, and they held up really well! In fact, we think the only reason Residents were able to log in at all during this outage was because they happened to get really lucky and got randomly assigned to one of the next generation login servers that we’re testing.
Testing of the new log-in servers has yet to be completed, but April notes that the hope is they be ready for deployment soon.
Thanks once again to April for the update on the situation.