SL projects update week 40 / 2

Server Deploys

As many are aware, there was a major error in this week’s LeTigre Release Channel deploy. Apparently, the root cause of the problem lay in the server-side prim account code, which Simon Linden describes as having “blown up” on the LeTigre RC channel. This resulted in a large number of items (including partial builds) being returned to people’s inventories as a result of regions being seen as “full”. The problem required a two-stage recovery:

  • LeTigre regions were rolled back to a state prior to the faulty deployment, and were then updated with the BlueSteel code also deployed on Wednesday October 3rd. This helped to determine the extent of the damage (a total of some 1200 regions)
  • The regions damaged by the land impact miscalculation were then restore to a state prior to the roll-out of the original faulty LeTigre code. These had to be restored manually, which took a considerable time

There is further post-mortem work going to to try and discover why this error did not reveal itself when the code deployed to LeTigre was being tested on Aditi, and whether there is anything specific to the regions impacted by the error which may have triggered it. Thought is now also being given to managing large scale region restorations, despite this being the first time there has been such a massive issue of this kind occurring across the grid.

Current RC plans for next week call for the same maintenance release to be made to all three RC channels, which Simon Linden describes as, “Mostly internal changes but [which] does include a minor update for the physics engine library … It’s almost all updating libraries … we’ve been using a fairly old set of compilers and such to make some of the development builds of the servers, and this brings us to more recent code.” Further details on the deploy should be available next week in the Second Life Server section of the Technology forum.

SL Viewer

As indicated in part one of this report earlier this week, problems have continued with the Beta viewer code and high crash rates. Work has been ongoing to try and locate the probable cause(s), some of which included the temporary return of tcmalloc. While not actually a cause of the crash issues, having tcmalloc disabled was affecting efforts to reproduce the problems. a beta release was made on the 3rd/4th October (3.4.1.265434), which is proving to be a lot more stable than previous versions, and which happens to have tcmalloc enabled.

The current plan is for a further beta release to be made, most likely on Monday 8th October, which should see tcmalloc turned off once more (if not removed). Should this also prove to be stable, the fixes it contains will be merged back into the development viewer code, and this will clear the way for clearing the backlog of code merges for both the beta and development viewers. It may also see a further 3.4.1 release version of the viewer being made.

Among the projects awaiting merging into the development and beta viewer code are:

  • The Steam support changes, which have been available within a development viewer stream, and which are described as “mostly cosmetic”. There is apparently a version of the viewer on Steam, but it is not available for general viewing / download, and is presumably there for testing purposes
  • Monty Linden’s HTTP library (texture fetch) code
  • Baker Linden’s Group Services project code
  • Apple OSX 10.8 Mountain Lion support work, including gatekeeper compatibility
  • Bug fixes and further regionalisation work.

Previous plans for these releases called for them to be made under the 3.4.2 code base. While this wasn’t discussed at the TPV/Dev meeting, one assumes this is still the case. However, speaking at the TPV Dev meeting on Friday October 5th, Oz Linden indicated that the order, etc., in which waiting merges will be cleared hasn’t been fully defined, and will be the subject of internal conversations next week at the Lab.

Avatar Baking Project

Bake fail: a familiar problem for many

There is still no major news on this project, although work is continuing both on the viewer and on the server code.

The plan remains to provide TPV developers with access to the viewer code at least 8 weeks ahead of any initial deployment of the server-side code to an Agni release channel. This is to allow TPVs time to merge the code into their viewers and participate in ongoing testing of the new service.There is a possibility that that viewer code will be available sufficiently well ahead of things in order for TPVs to be able to use it alongside the testing on Aditi (beta grid), depending on the status of the beta grid tests and how development of the viewer code progresses.

Please use the page numbers below left to continue reading this report

8 thoughts on “SL projects update week 40 / 2

  1. Having commented on “Black Wednesday” on an earlier post, I won’t repeat myself. However, this news from Simon Linden puzzles and worries me.
    Yes, the “convex hull” breakage resulted in a lot of prims being returned on sims that were close to or at their theoretical limits. My experience, on a sim that is nowhere near its prim-allocation limit, is quite different. I and my partner both had mutliprim objects “returned” (eventually) despite our sim’s allocation limit not being exceeded. All returned items were scripted, and many scripted objects appeared “broken” under the first code roll. This physics- or script-breakage does not seem to feature in Simon’s triage of the debacle. I have to hope that this was NOT connected with the “physics update”. If it was, I foresee another sorry situation next Wednesday which will require rollbacks if massive asset damage is not to ensue.
    I am beginning to wonder if The Lab really does understand fully what carnage they created on LeTigre last Wednesday.

    Like

    1. Sadly, I wasn’t at the Friday meeting (SLurl) in order to raise your comments (and I actually would have, had I been there). It’s held at midnight UK time, and I’m reliant on the largesse of a couple of friends to send me transcripts.

      If you can make the meeting next Tuesday, 8:00 UK time (same SLurl as above, meetings are held twice a week), you can raise concerns with Simon and Andrew, et al.

      There is also Oskar’s Beta server meeting (Aditi SLurl) at 23:00 UK on Thursdays (which is also an awkward one for me to make).

      Like

    2. As I understand it, there were two problems with the LeTigre update. One was the massive return of objects noted here — and, yeah, it very definitely affected regions and parcels that were nowhere near their limits. Whatever went wrong with the land impact accounting, it went very, very wrong.

      I had a bunch of content returned from relatively prim-sparse land, but most of it was not scripted. One _possible_ reason that scripted items might be more subject to being returned would be if more such items belonged to an account other than the landowner, compared to unscripted items. (This was suggested as one reason that group-owned land was particularly hard-hit.)

      The other problem was that the LSL function llGetPos() always returned in the updated sims. This may have been what caused the reported script breakage. I believe that the Lab is aware of that problem, too, but it wouldn’t hurt to confirm it.

      Regarding the post-mortem, I hope they not only consider what needs to change to make sure no deployment ever goes so wrong again, but also some tools for Operations to detect much sooner that something is going wrong. Clearly, during the roll out, the affected sims were doing very strange things; a massive drop in total number of objects would be a big red flag, if the sims were instrumented for such things. A valuable risk containment step would be to automatically monitor not just one specific metric like that, but any large deviation from normal conditions that coincide with Operations activity.

      Like

      1. (Hmmm. Apparently WordPress eats angle-bracketed expressions. I’ll try a different way of saying it: llGetPos() always returned ZERO_VECTOR in the updated sims.)

        Like

        1. WordPress is doing some bizarre things right now, both on displayed pages and within the inbuilt editor. Most of the issues I’ve encountered have occurred since my switch to a new layout template. This might be a placebo observation (and certainly doesn’t explain the problems I’m witnessing when using the inbuilt editor), but is certainly curious. Apologies for the problem you had.

          Like

  2. I’d love to go to one of these almost mythical Linden Office Hours that LL hold, but 8:00am BST is before my medications kick in and I am even more witless that usual. 12midnight is offlimits for much thse same reason, except by then the meds are wearing off!
    If I sent Simon a notecard to his inworld account, do you think he would read it?

    Like

    1. Sorry, my fault not 8:00am 20:00 (8:00pm). I broke my usual rule and used 12-hour notation instead of 24, which I use to avoid this kind of confusion. Clearly, not enough coffee in my bloodstream today…

      Like

  3. Once again I build to a new, if unannounced, “feature” only to discover that it is a bug! Lately we have been building walls and fences. I noticed that longer linksets were possible and took advantage of it.
    I will wait and see what happens 🙂

    Like

Comments are closed.