In May of 2014, Landon Linden, aka Landon McDowell, the Lab’s VP of Operations and Platform Engineering, wrote a blog post on the reasons why a series of issues combined to make Second Life especially uncomfortable for many.
At the time, and as many bloggers and commentators – myself included – noted, the post came as a refreshing breath of fresh air after so long without meat-and-veg communications from the Lab in terms of what is going on with the platform and why things can go wrong.
Now Landon is back explaining how the Lab’s Ops team responds to issues within their services, the communications tools they use – and why the tools are so effective.
The speed of text communication is much faster. The average adult can read about twice as fast as they can listen. This effect is amplified with chat comms being multiplexed, meaning multiple speakers can talk intelligibly at the same time. With practice, a participant can even quickly understand multiple conversations interleaved in the same channel. The power of this cannot be overstated.
In a room or on a conference call, there can only be one speaker at a time. During an outage when tensions are high this kind of order can be difficult to maintain. People naturally want to blurt out what they are seeing. There are methods of dealing with this, such as leader-designating speakers or “conch shell” type protocols. In practice though, what often prevails is what one of my vendors calls the “Mountain View Protocol,” where the loudest speaker is the one who’s heard.
In text, responders are able to hop out of a conversation, focus on some investigation or action, hop back in, and quickly catch up due to the presence of scroll back. In verbal comms, responders check-out to do some work and lose track of the conversation resulting in a lot of repeating.
He also notes that not everyone is involved in a situation right from the start. Issues get escalated as they evolve, additional support may be called-in, or the net widened in the search for underlying causes, requiring additional teams to be involved, or the impact of an incident spreads. Chat and the idea of “reading scrollback” as the Lab calls it, allows people to come on-stream for a given situation and fully au fait with what has occurred and what is happening in a manner not always possible through voice communications and briefings, and without breaking the ongoing flow of communications and thinking on the issue.
The multiplexing capabilities of chat also mean that individuals can disengage from the main conversation, have private exchanges which, while pertinent to the issue, might otherwise derail the core conversation or even be silenced in something like a teleconference – and those engaged in such exchanges can still keep abreast of the central conversations.
For an environment like the Lab, where operations and personnel are distributed (data centres and offices located in different states / on different coasts, not everyone working from an office environment, etc.), chat has proven a powerful tool, although one that may take time getting to grips with, as Landon notes about his first exposure, saying:
However, the benefits in using it far outweigh any need for a degree of gear shifting required by ops staff in learning to use the approach. As Landon states in closing his comments, “when it works it is a wondrous thing to behold, a ballet in a war zone, beautiful, terrifying, and glorious.”
This is another great insight into what happens inside the Lab, and as such, the post makes very worthwhile reading, whether or not you have a background in Ops support.