Of channels and restarts

Once upon a time server roll-outs for Second Life were handled in what, on the surface, would seem a fairly straightforward manner:

  • New code would be tested on the Beta Grid, with users reporting any bugs or issues to LL for fixing
  • When considered relatively stable, the code would be rolled out on a limited basis to the Main Grid (affecting around 20% of the grid in total) for further “testing”; if major problems were found, the limited roll-out (or “pilot”), would be rolled back
  • If considered stable, the code would be rolled out to the remaining 80% of the grid, generally around 24 hours after the pilot.

The system wasn’t flawless; the complexity of the server code meant that many small (and one would guess conflicting) updates would be “rolled-up” into a single release, often with unpredictable results, despite testing on the Beta Grid. This would result in what I call the “tidal effect”: a change would be rolled out as a pilot, then rolled back for fixing, then rolled out before being rolled back for further fixing, and then rolled out once more, then rolled out again to the entire Main Grid. Sometimes even then, it would go through one more rollback / rollout.

As we’re all only too aware, this approach meant fairly large and constant upheavals for just about everyone concerned, and the cause of much gnashing of teeth and dark mutterings towards Linden Lab.

To try and minimise the overall impact of server code updates and roll-outs, Linden Lab switched over to a “channel” system. Under this system, server code is operated across four channels: the Release Channel, with the latest “release version” of the server code (and supposedly the most stable), and three “Release Candidate” channels, code-named Blue Steel, Magnum and Le Tigre.

Each of the RC channels comprises about 10% of the total Main Grid, and is used to roll-out a “beta” of a specific server code package. This might be a series of bug fixes (e.g. specific SVC JIRA fixes), it might be a general maintenance release (e.g. security updates, etc.), or it may be related to a specific, on-going project (such as display Names, the “Fast Assets” project, the “Inventory Capabilities” project, and so on). Broadly speaking, specific projects tend to be rolled out through specific channels (The Inventory Capabilities project tends to rollout via Blue Steel, for example, as do changes related to the forthcoming arrival of Mesh) – although this is not a hard and fast rule. General maintenance releases, on the other hand, are distributed between all three channels, depending on which has the capacity at the time a release package is ready for beta testing.

So, at any one point in time, some 30% of the grid is hosting what is effectively “beta” software, but in very discrete “chunks”, so to speak, confined to known sets of simulators. The releases themselves are also smaller and more easily managed / identifiable, making everything that much easier to manage and, in theory at least, making issues that much easier to identify and correct.

Broadly speaking, this is how it works:

  • An update (be it bug fixes or whatever) is readied for release as a “beta”. If it is related to a specific project it may be targeted at a specific RC channel (Blue Steel, Magnum or le Tigre)
  •  On the Wednesday of each week, the Release Candidates for each channel are rolled out to their respective 10% of the Grid; if a specific channel doesn’t have a candidate waiting, this obviously, nothing is rolled out
  • Over the course of the next week, the candidate’s performance and impact on the Main Grid is monitored (and the channel servers may be subjected to numerous restarts. If the candidate proves particularly problematic, it may even be rolled back
  • If the candidate appears to be stable after 6 days, then it is (together with any candidates from the other two channels) rolled out to the entire Main Grid the following Tuesday
  • The cycle then repeats with the next RC in the channel dropping into its assigned servers on Wednesday.

If a specific RC causes problems, then the cycle for a specific channel may be broken for a week while the issue is worked on (for example, if a candidate on Le Tigre, say, proves that it is not ready for release as scheduled on a Wednesday, it will be “held over” for a week and made ready for release the next Wednesday).

There is one other channel worth mentioning that doesn’t get a lot of publicity: the “Snack” channel which handles releases related to (among other things) Mono-2 updates and various script monitoring tools. These are known to behave unpredictably, and so are initially rolled out to a very limited number of sims for testing. I understand that once tested, the fixes then go on for wider testing via (usually) Magnum prior to a full rollout.

The benefits of this system are obvious: if there is a major problem with a Release Candidate, it will only affect 10% of the Main Grid (rather than 20% with the old system); the releases are less complex, making it easier for the root cause of specific problems to be identified and corrected. Overall, the process means that there is less widespread upheaval across the Main Grid than tended to be the case with the old, larger-scale releases. There are many examples of these advantages; when a recent change impacted breedable horses, for example, it only affected a small percentage of horses on the grid (only those present on servers running the specific release channel software).

Of course, there are what appear to be downsides to the new system: the release channels (particularly, it would seem, Le Tigre), can be in a state of flux when problems do occur; and the weekly rollouts, with their need for sim restarts, on both Tuesday and Wednesday has been the topic of many a complaint. A minor irritant is the pop-up that comes up when moving between sims running different server releases, be they a release channel or the “full” release – it would be nice if these could be turned off by those who have no interest in what software is being run on a given simulator, just as other pop-ups can be user-disabled through the Viewer.

But, these grumbles aside, it has to be said the new system works. While it does cause a degree of pain for those “stuck” on simulators running one of the release channels, the vast majority of the grid has seen far less upset and upheaval when things have gone wrong. Certainly, the “tidal effect” of gird-wide rollouts/rollbacks has become largely a thing of the past, and while the rolling restarts associated with Tuesdays and Wednesdays might seem inconvenient when they begin, the truth is they’re probably less so than they were under the old system.