“A ballet in a war zone, beautiful, terrifying, and glorious” – inside LL’s Ops team

secondlifeIn May of 2014, Landon Linden, aka Landon McDowell, the Lab’s VP of Operations and Platform Engineering, wrote a blog post on the reasons why a series of issues combined to make Second Life especially uncomfortable for many.

At the time, and as many bloggers and commentators – myself included – noted, the post came as a refreshing breath of fresh air after so long without meat-and-veg communications from the Lab in terms of what is going on with the platform and why things can go wrong.

Now Landon is back explaining how the Lab’s Ops team responds to issues within their services, the communications tools they use – and why the tools are so effective.

An Inside Look at How The Ops Team Collaborates is once again an interesting and informative piece, delving into not only the technical aspects of how the Lab respond to problems within their services, but which also encompasses the very human aspects of the dealing with issues – handling emotions when tensions are high, opening the window for those not directly involved in matter to keep an eye on what is happening so that they can also make better informed decisions on their own actions, and more.

Landon McDowell, the Lab's VP of Operations and Platform Engineering and his alter-ego, Landon Linden
Landon McDowell, the Lab’s VP of Operations and Platform Engineering and his alter-ego, Landon Linden

The core of the Lab’s approach to incident communications is the use of text chat (specifically IRC) rather than any reliance on crash team meetings, the telephone and so on. Those who deal with the Lab on a technical level won’t be surprised at the use of IRC – it is a fairly strong channel of communication for the Lab in a number of areas; but what makes this post particularly interesting is the manner in which the use of IRC is presented and used: as a central incident and problem management tool for active issues; as a means of ensuring people can quickly get up-to-speed with both what has happened in a situation, and what has been determined / done in trying to deal with it; as a means of providing post-mortem information;  and as a tool for helping train new hires.

These benefits start with what is seen as the sheer speed of communication chat allows, as Landon notes:

The speed of text communication is much faster. The average adult can read about twice as fast as they can listen. This effect is amplified with chat comms being multiplexed, meaning multiple speakers can talk intelligibly at the same time. With practice, a participant can even quickly understand multiple conversations interleaved in the same channel. The power of this cannot be overstated.

In a room or on a conference call, there can only be one speaker at a time. During an outage when tensions are high this kind of order can be difficult to maintain. People naturally want to blurt out what they are seeing. There are methods of dealing with this, such as leader-designating speakers or “conch shell” type protocols. In practice though, what often prevails is what one of my vendors calls the “Mountain View Protocol,” where the loudest speaker is the one who’s heard.

In text, responders are able to hop out of a conversation, focus on some investigation or action, hop back in, and quickly catch up due to the presence of scroll back. In verbal comms, responders check-out to do some work and lose track of the conversation resulting in a lot of repeating.

He also notes that not everyone is involved in a situation right from the start. Issues get escalated as they evolve, additional support may be called-in, or the net widened in the search for underlying causes, requiring additional teams to be involved, or the impact of an incident spreads. Chat and the idea of “reading scrollback” as the Lab calls it, allows people to come on-stream for a given situation and fully au fait with what has occurred and what is happening in a manner not always possible through voice communications and briefings, and without breaking the ongoing flow of communications and thinking on the issue.

The multiplexing capabilities of chat also mean that individuals can disengage from the main conversation, have private exchanges which, while pertinent to the issue, might otherwise derail the core conversation or even be silenced in something like a teleconference – and those engaged in such exchanges can still keep abreast of the central conversations.

For an environment like the Lab, where operations and personnel are distributed (data centres and offices located in different states / on different coasts, not everyone working from an office environment, etc.), chat has proven a powerful tool, although one that may take time getting to grips with, as Landon notes about his first exposure, saying:

I … just sat there staring at the screen wondering what the hell had just happened, wondering what the hell I had gotten myself into. I thought I was a seasoned pro, but I had never ever seen an incident response go that smoothly or quickly. Panic started to set in. I was out of my league.

However, the benefits in using it far outweigh any need for a degree of gear shifting required by ops staff in learning to use the approach. As Landon states in closing his comments, “when it works it is a wondrous thing to behold, a ballet in a war zone, beautiful, terrifying, and glorious.”

This is another great insight into what happens inside the Lab, and as such, the post makes very worthwhile reading, whether or not you have a background in Ops support.

Advertisements

Why things went wrong recently with Second Life, by Landon Linden

secondlifeWe’re all aware of the recent unpleasantness which hit Second Life over the past few weeks and which culminated in the chaos of Tuesday, May 20th, when the disruption not only caused issues with log-ins, but also caused both a curtailment in server-side deployments on Tuesday and a rescheduling of both deployments for the rest of the week and the postponing of a period of planned maintenance.

As noted in my week 20/2 SL projects update, Simon and Maestro Linden gave an explanation of Tuesday’s issues at the Serve Beta meeting on Thursday May 22nd. However, in a Tools and Technology blog post, Landon Linden has given a comprehensive explanation of the broader issues that have hit second Life in recent weeks.

Landon begins the post:

When I came to Linden Lab over five years ago, Second Life had gone through a period of the coveted hockey-stick growth, and we had just not kept up with the technical demands such growth creates. One or more major outages a week were common.

In my first few months at the Lab, we removed more than a hundred major single points of failure in our service, but several major ones still loomed large, the granddaddy of them all being the core MySQL database server. By late Winter 2009 we were suffering from a core database outage a few times each week.

It is that core MySQL database server that has been partially to blame for the recent problems, having hit two different fatal hardware faults which forced the Lab to stop most SL services on both occasions. As the blog post explains, work is in-hand to remove some of the risk in this database becoming a single point of failure by moving it to new hardware. This will be followed over the coming weeks and months to try to further reduce the impact of database failures.

But the MySQL issue wasn’t the only cause of problems, as Landon further explains:

A few weeks ago there was a massive distributed denial of service attack on one of our upstream service providers that affected most of their customers, including us, and inhibited the ability of some to use our services. We have since mitigated future potential impact from such an attack by adding an additional provider. There have also been hardware failures in the Marketplace search infrastructure that have impacted that site, a problem that we are continuing to work through.

Landon Linden: why things went squiffy with SL
Landon Linden: explaining why SL  has suffered servere issues of late

He also provides further information on the issue which impacted users and services on Tuesday May 20th, expanding on that given by Simon and Maestro at the Server Beta meeting.

At that meeting, Simon briefly outlined Tuesday’s issues as being a case of the log-in server failing to give the viewer the correct token for it to connect to a region, so people actually got through the log-in phase when starting their viewer, but never connected to a region.

Landon expands on this, describing how the mechanism for handing-off of sessions from login to users’ initial regions is a decade old and relies on the generation of a unique identifier (the “token” Simon referred to). Simply put: the mechanism ran out of numbers – but did so quietly and without flagging the fact that it had. As a result, the server team took four hours to track down the problem and come up with a fix.

Referring to this particular issue, Landon goes on:

Having such a hidden fault in a core service  is unacceptable, so we are doing a thorough review of the login process to determine if there are any more problems like this lurking. Our intent at this point also is to remove the identifier assignment service altogether. It not only was the ultimate source of this outage, but is also one more single point of failure that should have been dispatched long ago.

Such open honesty and transparency about technical matters is something that hasn’t really been seen from the Lab since the departure of Frank (FJ Linden) Ambrose, the Lab’s former Senior VP of Global Technology, who departed the company at the end of 2011. As such, it is an excellent demonstration of Ebbe Altberg’s promise to re-open the lines of communication between company and users, and one which is most welcome.

Kudos to Landon for his sincere apology for the disruption in services and  for such a comprehensive explanation of the problems. Having such information will hopefully aid our understanding of the challenges the Lab faces in dealing with a complex set of services which is over a decade old, but which we expect to be ready and waiting for us 24/7. Kudos, again as well to Ebbe Altberg for re-opening the hailing frequencies. Long may it continue.

Related Links

Ebbe: the promise of better communications and a more open JIRA

Since his first official blog post introducing himself, Ebbe Altberg has not only been immersing himself in the activities required of a new CEO on joining a company, he’s been making the time to respond to a series of SL forum posts made in a thread started as a result of his blog post.

In doing so, he’s demonstrated the same candid feedback which has marked many of his Twitter exchanges with Second Life users, and also shown during his recent meet-and-greet with a number of us.

LL's new CEO, Ebbe Altberg, seen here on the right in his guise as Ebbe Linden at a recent meet-and-greet: laying the foundations for improved communications from the Lab?
LL’s new CEO, Ebbe Altberg, seen here on the right in his guise as Ebbe Linden at a recent meet-and-greet: laying the foundations for improved communications from the Lab?

On Communications

One of the major topics of early exchanges with him via Twitter and through various blogs has been on the subject of broader outward communications from the Lab.

Commenting on the forum thread, Amethyst Jetaime raises communications, saying in part:

However I hope you at least take our opinions to heart, take our suggestions when you can and honestly communicate frequently through the official SL channels. Not all of us use twitter and facebook or third-party forums …

His reply to her is encouraging:

Everybody I’ve spoken with here at LL want to improve communication with our customers as well…funny that…

He expands on this in a subsequent reply to  a similar comment from Venus Petrov, in which he says:

And they can’t wait to do that…most common question/issue on both sides of the “fence” has been the same thing! I’m getting love from both sides when I’m talking about fixing communication. I don’t know when/how it got strange but we’ll work hard to make us better at it…motivation is not an issue at all. We just need to figure out process for doing it effectively at scale…

How this will be achieved is open to debate; but the Lab has the means at their disposal to make broad-based communications far more effective, and I tried to point to some of them in my own “Dear Ebbe…” blog post on the matter. In that piece, I particularly look at both the official SL blog and the opportunities presented by e-mail, both of which would appear to meet the criteria of scalability, with an e-mail approach additionally having the potential to reach out to those no longer directly engaged in SL on a regular basis or at all and perhaps encourage them to take another look.

On the Public JIRA

Elsewhere in the thread, Pamela Galli takes the issue of communications to point to the closure of the public JIRA in September 2012:

… In the opinions of many, a good place to start is to make the JIRAs public again so we will know whether an issue is a bug that has arisen, or something on our end. Very often, residents working with Lindens have identified, reproduced, and even come up with workarounds if not solutions to problems. Closing the JIRA felt like a door being slammed, esp to those of us who are heavily invested in SL. (Just grateful for Maestro, who posts in the Server Forum.)

Again, there is an encouraging response:

Funny, both engineering and product heads here also didn’t like that jira was closed and want to open it up again. Proposal for how is in the works! I hope we can figure out how to do that in a way that works/scales soon.

Later in the thread, Innula Zenovka who provides one of the most lucid, clearly stated reasons why a complete closure of the public JIRA was perhaps more counter-productive from a technical standpoint than the Lab may have appreciated at the time. Ebbe’s response is again equally reassuring:

Yep, that’s why we will figure out how to open things up again…plan is in the works…

Whether we’ll see a complete re-opening of the public JIRA remains to be seen. I rather suspect the Lab will be looking at something more middle-ground, such as making the JIRA public, but restricting comments to those currently able to access it, together with those actually raising a report also gaining the ability to comment on it as a means of providing additional input / feedback.

While not absolutely perfect, it would mean that the Lab avoids any situation where comments within a JIRA become a free-for-all for complaints, accusations, and arguments (either directed at the Lab or between comment participants), while offering the majority of the advantages which used to be apparent with a more open JIRA mechanism.

Of course, optimism around this feedback – and particularly around the proposal for the JIRA – should be caveated with caution. Not only may it take time for changes to be implemented, it may also be that technical or other issues may impede something like a more open approach to the JIRA from being achieve to the extent that even the Lab would like. However, that there is a willingness to discuss the fact that matters are already under consideration at the Lab would hopefully suggest a reasonable level of confidence that things can be done without risking the disappointment following the decision that there would be no return of last names back in March 2012.

Whatever does happen, there’s enough in these replies to give rise to a cautious and reasonable optimism that things are likely to be changing for the better down the road. Most certainly, it is good to see an outward follow of communication from the Lab’s CEO that is open and candid.

Long may it continue once Ebbe has had to turn his attention more fully on running the company, and others have stepped in to fill the void, and to ensure the follow-through is both achieved and consistent.

 

Silence may be golden, but it also weighs heavy

I keep promising myself I won’t start banging  on about Linden Lab’s inability to openly communicate. That was more-or-less the tone of things in this blog back in 2011 (see my views on business, communication and growth, and the growing frustration over the Marketplace situation in 2012, and weel as point in between and after, if interested). However…

Rod Humble may have gone, but the Lab apparently has yet to issue any statement in reponse to enquiries from the media
Rod Humble may have gone, but the Lab has yet to issue any statement in response to enquiries from the media

Friday 24th January saw the news break that Rod Humble had departed the Lab. According to his own comments pass to others at the time of the announcement, he’d left the Lab “last week”. If so, this could mean the Lab has been absent a CEO for about two weeks, and they have yet to say anything on the matter.

It’s not just the fact that repeated enquiries from the likes of Hamlet Au and I (among others) have gone without response – we’re still small fish in the ocean of blogging / journalism. Where the story has been picked-up by the games media, it also appears that enquiries made to the Lab also remain unanswered.

True, the message has been somewhat slow in spreading to the media at large; only Gamesbeat picked-up on the news in the 24th along with as did Games Industry. Since then Gamasutra covered the news on January 28th, as did  Massively. Nevertheless, one would have thought some message would have been forthcoming from the Lab in order to squash the potential for speculation or negative rumours to become established as fact.  Or could it be that Rod Humble’s annoucement was a knickers-around-ankles moment for the Lab?

See what I mean about speculation?

Beyond this, as Ciaran Laval observes, there is still ongoing confusion and upset relating to attempts to cash-out and  / or tax ID requirements.  A part of this seems to be down to the Lab possibly being overwhelmed by the inflow of documentation, and it is taking time to clear things up. However, the fact that noting is  – once again – being done to communication matters and provide some form of open feedback really isn’t helping matters at all.

Of course, the Lab may well feel secure in its position that the majority of SL users are likely to be oblivious as to what is going on, and are happy knowing that SL is still there for them when they are ready to log-in. But in terms of those who are investing time, effort and money into helping make Second Life a place people want to log-in to and enjoy, not actually taking the time and effort to offer reasonable clarification of what is going on as requires things like cash-outs and tax (and, indeed, what is and isn’t required ahead of time) doesn’t tend to send a positive message, but does tend to add a little more weight to an overburdened camel’s back.

In writing about Rod Humble’s tenure, I pointed out that communications had started on a downward trend prior to his arrival, and had continued to sink throughout his time there, despite his own initial attempts to ramp things up. This smacks of a deep-seated cultural element within the company (driven out of the board?) which doesn’t see communications as having any real priority. As such, I’m not holding my breath in the hope that things will change, even with a new CEO, when (if?) we ever get to hear about one being appointed.

But even a short-term upswing, as witnessed in the months immediately following Humble’s arrival at the Lab prior to the downward trend resuming, would actually be better than we have at the moment.  I won’t borrow from Tateru again and use her Silence of the Lab logo, but I can admit, I’m sorely tempted to do so.