Update: At the time this article went to press, it appeared the daily restarts were still in progress (hence the reference to the restarts being Nov 17th-21st). Subsequent to this article appearing, the Lab updated the Grid Status report to indicate the work has actually bee completed, therefore the Lab’s blog post did in fact mark the end of the work.
The week of November 17th – 21st 2014 has been marked with daily periods of region restarts. Notice that these would be going on was first posted via a Grid Status update on Friday, November 14th.
As I noted in the first of my SL project updates for the week, Simon Linden indicated that restarts and the attendant maintenance was hardware-related, requiring servers to be taken down and physically opened-up, although precise details on what was being done was still scant.
In a blog post published on Thursday, November 20th, the Lab provided a detailed explanation on the reasons for the restarts, which reads in full:
Keeping the systems running the Second Life infrastructure operating smoothly is no mean feat. Our monitoring infrastructure keeps an eye on our machines every second, and a team of people work around the clock to ensure that Second Life runs smoothly. We do our best to replace failing systems pro actively and invisibly to Residents. Unfortunately, sometimes unexpected problems arise.
In late July, a hardware failure took down four of our latest-generation of simulator hosts. Initially, this was attributed to be a random failure, and the machine was sent off to our vendor for repair. In early October, a second failure took down another four machines. Two weeks later, another failure on another four hosts.
Each host lives inside a chassis along with three other hosts. These four hosts all share a common backplane that provides the hosts with power, networking and storage. The failures were traced to an overheating and subsequent failure of a component on these backplanes.
After exhaustive investigation with our vendor, the root cause of the failures turned out to be a hardware defect in a backplane component. We arranged an on-site visit by our vendor to locate, identify, and replace the affected backplanes. Members of our operations team have been working this week with our vendor in our data centre to inspect every potentially affected system and replace the defective component to prevent any more failures.
The region restarts that some of you have experienced this week were an unfortunate side-effect of this critical maintenance work. We have done our best to keep these restarts to a minimum as we understand just how disruptive a region restart can be. The affected machines have been repaired, and returned to service and we are confident that no more failures of this type will occur in the future. Thank you all for your patience and understanding as we have proceeded through the extended maintenance window this week.
Once again, it’s good to see that Landon Linden and his team are keeping the channels of communication open, and working to keep users appraised of what’s happening whenever and wherever is necessary / they can.