Of channels and restarts

Once upon a time server roll-outs for Second Life were handled in what, on the surface, would seem a fairly straightforward manner:

  • New code would be tested on the Beta Grid, with users reporting any bugs or issues to LL for fixing
  • When considered relatively stable, the code would be rolled out on a limited basis to the Main Grid (affecting around 20% of the grid in total) for further “testing”; if major problems were found, the limited roll-out (or “pilot”), would be rolled back
  • If considered stable, the code would be rolled out to the remaining 80% of the grid, generally around 24 hours after the pilot.

The system wasn’t flawless; the complexity of the server code meant that many small (and one would guess conflicting) updates would be “rolled-up” into a single release, often with unpredictable results, despite testing on the Beta Grid. This would result in what I call the “tidal effect”: a change would be rolled out as a pilot, then rolled back for fixing, then rolled out before being rolled back for further fixing, and then rolled out once more, then rolled out again to the entire Main Grid. Sometimes even then, it would go through one more rollback / rollout.

As we’re all only too aware, this approach meant fairly large and constant upheavals for just about everyone concerned, and the cause of much gnashing of teeth and dark mutterings towards Linden Lab.

To try and minimise the overall impact of server code updates and roll-outs, Linden Lab switched over to a “channel” system. Under this system, server code is operated across four channels: the Release Channel, with the latest “release version” of the server code (and supposedly the most stable), and three “Release Candidate” channels, code-named Blue Steel, Magnum and Le Tigre.

Each of the RC channels comprises about 10% of the total Main Grid, and is used to roll-out a “beta” of a specific server code package. This might be a series of bug fixes (e.g. specific SVC JIRA fixes), it might be a general maintenance release (e.g. security updates, etc.), or it may be related to a specific, on-going project (such as display Names, the “Fast Assets” project, the “Inventory Capabilities” project, and so on). Broadly speaking, specific projects tend to be rolled out through specific channels (The Inventory Capabilities project tends to rollout via Blue Steel, for example, as do changes related to the forthcoming arrival of Mesh) – although this is not a hard and fast rule. General maintenance releases, on the other hand, are distributed between all three channels, depending on which has the capacity at the time a release package is ready for beta testing.

So, at any one point in time, some 30% of the grid is hosting what is effectively “beta” software, but in very discrete “chunks”, so to speak, confined to known sets of simulators. The releases themselves are also smaller and more easily managed / identifiable, making everything that much easier to manage and, in theory at least, making issues that much easier to identify and correct.

Broadly speaking, this is how it works:

  • An update (be it bug fixes or whatever) is readied for release as a “beta”. If it is related to a specific project it may be targeted at a specific RC channel (Blue Steel, Magnum or le Tigre)
  •  On the Wednesday of each week, the Release Candidates for each channel are rolled out to their respective 10% of the Grid; if a specific channel doesn’t have a candidate waiting, this obviously, nothing is rolled out
  • Over the course of the next week, the candidate’s performance and impact on the Main Grid is monitored (and the channel servers may be subjected to numerous restarts. If the candidate proves particularly problematic, it may even be rolled back
  • If the candidate appears to be stable after 6 days, then it is (together with any candidates from the other two channels) rolled out to the entire Main Grid the following Tuesday
  • The cycle then repeats with the next RC in the channel dropping into its assigned servers on Wednesday.

If a specific RC causes problems, then the cycle for a specific channel may be broken for a week while the issue is worked on (for example, if a candidate on Le Tigre, say, proves that it is not ready for release as scheduled on a Wednesday, it will be “held over” for a week and made ready for release the next Wednesday).

There is one other channel worth mentioning that doesn’t get a lot of publicity: the “Snack” channel which handles releases related to (among other things) Mono-2 updates and various script monitoring tools. These are known to behave unpredictably, and so are initially rolled out to a very limited number of sims for testing. I understand that once tested, the fixes then go on for wider testing via (usually) Magnum prior to a full rollout.

The benefits of this system are obvious: if there is a major problem with a Release Candidate, it will only affect 10% of the Main Grid (rather than 20% with the old system); the releases are less complex, making it easier for the root cause of specific problems to be identified and corrected. Overall, the process means that there is less widespread upheaval across the Main Grid than tended to be the case with the old, larger-scale releases. There are many examples of these advantages; when a recent change impacted breedable horses, for example, it only affected a small percentage of horses on the grid (only those present on servers running the specific release channel software).

Of course, there are what appear to be downsides to the new system: the release channels (particularly, it would seem, Le Tigre), can be in a state of flux when problems do occur; and the weekly rollouts, with their need for sim restarts, on both Tuesday and Wednesday has been the topic of many a complaint. A minor irritant is the pop-up that comes up when moving between sims running different server releases, be they a release channel or the “full” release – it would be nice if these could be turned off by those who have no interest in what software is being run on a given simulator, just as other pop-ups can be user-disabled through the Viewer.

But, these grumbles aside, it has to be said the new system works. While it does cause a degree of pain for those “stuck” on simulators running one of the release channels, the vast majority of the grid has seen far less upset and upheaval when things have gone wrong. Certainly, the “tidal effect” of gird-wide rollouts/rollbacks has become largely a thing of the past, and while the rolling restarts associated with Tuesdays and Wednesdays might seem inconvenient when they begin, the truth is they’re probably less so than they were under the old system.

4 thoughts on “Of channels and restarts

  1. I for one can certainly live with a few sim restarts – I still remember the days when they closed the grid down for maintenance every Wednesday, and compared to that all present day “upheavals” are just a minor annoyance. The way they’ve been rolling out updates has vastly improved since then, and the new “live” system is definitely one of the things Linden Labs have done right in the last couple of years.

    Like

    1. I wonder how many do remember the 2001-esque “bash on things” splash screen that would be all one could see during the 6-8 hours the Grid was down every Wednesday – right through the afternoon / evening for those in Europe.

      Things have improved since then – although it’s easy to forgot the positives over the years and focus on the negatives when it comes to SL. I’m obviously biased where the current channel system is concerned, as I’m on a simulator that runs the “release” software, so I tend to only experience one or two restarts on a Tuesday; I know others on the RC channels are not so fortunate. But, that said, rolling restarts that take down sims for between 5 and 30 minutes is massively better than the old days, or even the more recent periods when a rollback might have also been accompanied with a suspension of log-ins, etc.

      It’s going to be interesting to see what happens with the introduction of the new infrastructure management system…

      Like

  2. Of course the new system is much better than the earlier ones 😛 The worst case scenario is that only 10% of the grid is down, and even if that is naturally bothering for the ones affected, it means that 90% of all residents will never have a problem logging in… which is very, very good.

    For me, the most important change in this new model (which is quite revolutionary in terms of software deployment, IMHO!) is that Linden Lab can actually test their server software in a “real” environment. That was always the problem with the tests done on the Preview/Beta Grid: it never was sufficiently complex (in terms of objects, avatars, and, most importantly, interaction between those) to allow proper testing. Virtual worlds, by design, escape all standard procedures of software design (there are even a few academic papers describing that!), and require novel alternatives to experimentation: one is that the world is so insanely complex that testing anything with just a tiny fraction of its complexity never turns out to be a valid test. But testing with 10% of the complexity apparently is enough.

    Like

    1. Assuming there is a fault with s single channel occurs, it’s restricted to just 10% of the grid ;-).

      But seriously – that is the benefit of the new system. And your point about the Beta Grid is well-made, and something I did gloss over here. It’s also something a lot of SL users possibly don’t appreciate (certainly given the number of times I’ve had to point it out in in-world conversations!): the Beta Grid is far, far less complex that the Main Grid, for the reasons you cite, ergo the results obtained through a period of Beta testing could be somewhat misleading and completely miss potential issues, which then would not become apparent until a release hit the Main Grid.

      The kind of process used by LL isn’t actually that uncommon in other software environments (or so I’ve been informed). A very good friend who is heavily involved in change management for a major investment bank informs me that this is precisely how the Deployment Team within his Operations Group handle deployments for their trading systems: everything is initially rolled-out in a controlled “change batch” to only impact around 10% of their user base in each trading location, “bedded” with them for a period of time and then rolled out to the trade floor as a whole.

      It seems a perfectly common-sense approach, and as Heloise notes, it has significantly improved matter for all of us overall.

      Like

Comments are closed.