On Friday November 6th, 2020 Lab Gab, the live streamed chat show hosted by Strawberry Linden on all things Second Life returned to the the subject of the work to transition all Second Life services to Amazon Web Services (AWS) and away from running on the Labs’ proprietary hardware and infrastructure.
The session came some 7 months after the last Lab Gab to focus on this work in April 2020 with Oz Linden and April Linden (see Lab Gab 20 summary: Second Life cloud uplift & more), and this time, Oz Linden sat in the hot seat alongside Mazidox Linden.
The official video of the segment is available via You Tube, and is embedded at the end of this article. The following is a summary of the key topics discussed and responses to questions asked.
Mazidox Linden is a relative newcomer to the Linden Lab team, having joined the company in 2017 – although like many Lab staff, he’s been a Second Life resident for considerably longer, having first signed-up in 2005.
|Table of Contents
He is the lead QA engineer for everything simulator related, which means his work not only encompasses the simulator and simhost code itself, but also touches on almost all of the back-end services the simulator software communicates with. For the last year he has been specifically focused on QA work related to transitioning the simulator code to AWS services. He took his name from the Mazidox pesticide and combined it with the idea of a bug spray to create is avatar, to visualise the idea of QA work being about finding and removing bugs.
Oz Linden joined the company in 2010 specifically to take on the role of managing the open-source aspects of the Second Life viewer and managing the relationship with third-party viewers, a role that fully engaged him during the first two years of his time at the Lab. His role then started expanding to encompass more and more of the engineering side of Second Life, leading to his currently senior position within the company.
What is the “Cloud Uplift”?
- Cloud Uplift is the term Linden Lab use for transitioning all of Second Life’s server-based operations and services from their own proprietary systems and services housed within a single co-location data centre to commercial cloud services.
- The work involves not only the visible aspects of SL – the simulators and web pages, etc., but also all the many back-end services operated as a part of the overall Second Life product, not all of which may be known to users.
- The process of moving individual services to the cloud is called “lift and shift” – take each element of software, making the required adjustments so it can run within a cloud computing environment, then relocate it to AWS infrastructure and hardware in a manner that allows it to keep running exactly as it did prior to the transfer, while avoiding disruptions that may impact users.
- The current plan is to have all of the transitional work completed before the end of 2020.
- However, this doe not mean all the the work related to operating SL in the cloud will have been completed: there will be further work on things like optimising how the various services run on AWS, etc.,
Why is it Important?
- It allows Second Life to run on hardware that is a lot more recent than the servers the Lab operates, and allows the Lab to evolve SL to run on newer and newer hardware as it becomes available a lot faster than is currently the case.
- In particular, up until now, the route to upgrading hardware has involved the Lab reviewing, testing and selecting hardware options, then making a large capital expenditure to procure the hardware, implement it, test it, then port their services over to the hardware and test, then implement – all of which could take up to 18 months to achieve.
- By leveraging AWS services, all of the initial heavy lifting of reviewing, testing, selecting and implementing new server types is managed entirely by Amazon, leaving the Lab with just the software testing / implementation work.
- A further benefit is that when SL was built, the capabilities to manage large-scale distributed systems at scale didn’t exist, so LL had to create their own. Today, such tools and services are a core part of product offerings alike AWS, allowing the Lab to leverage them and move away from having to run (and manage / update) dedicated software.
- Two practical benefits of the move are:
- Regions running on AWS can run more scripts / script events in the same amount of time than can be achieved on non-AWS regions.
- The way in which simulators are now managed mean that LL can more directly obtain logs for a specific region, filter logs by criteria to find information, etc., and the entire process is far less manually intensive.
How Secure is SL User Data on AWS?
- It has always been LL’s policy when dealing with third-party vendors (which is what AWS is) not to expose SL user data to those vendors, beyond what is absolutely necessary for the Lab to make use of the vendor’s service(s).
- This means that while SL user data is stored on AWS machines,it it not stored in a manner Amazon could read, and is further safeguarded by strict contractual requirements that deny a company like Amazon the right to use any of the information, even if they were to be able to read it.
- In fact, in most cases, user-sensitive data is effectively “hidden” from Amazon.
- LL is, and always has been, very sensitive to the need to protect user data,even from internal prying.
- In terms of the simulators, a core part of testing by Mazidox’s team is to ensure that where user data is being handled (e.g. account / payment information, etc.), it cannot even be reached internally by the lab, as certainly not through things like scripted enquiries, malicious intent or prying on the part of third-party vendors.
- [54:30-55:18] Taken as a whole, SL on AWS will be more secure, as Amazon provide additional protection against hacking, and these have been combined with significant changes LL have made to their services in the interest of security.
Why is Uplift Taking So Long?
- The biggest challenge has been continuing to offer SL as a 24/7 service to users without taking it down, or at least with minimal impact on users.
- This generally requires a lot of internal testing beforehand to reach a point of confidence to transition a service, then make the transition and then step back and wait to see if anything goes dramatically wrong, or users perceive a degraded service, etc.
- An example of this is extensive study, testing, etc., allowed LL to switch over inventory management from their own systems to being provisioned via AWS relatively early on in the process, and with no announcement it had been done – and users never noticed the difference.
- Another major challenge has been to investigate the AWS service offerings and determine how they might best be leveraged by SL services.
- As many of the SL services are overlapping one another (e.g. simulators utilise the inventory service, the group services, the IM services, etc.), a further element has been determining a methodical manner in which services can be transitioned without impacts users or interrupting dependencies on them that may exist elsewhere.
- The technology underpinning Second Life is a lot more advanced and recent within the AWS environment, and this means LL have a had to change how they go about certain aspects of managing SL. This has in turn required experimentation, perhaps the deployment of new tools and / or the update / replacement of code, etc..
Will Running on AWS Lower Operating Costs?
- During the transitional period it has been “significantly” more expensive to operate SL, inasmuch as LL is paying to continue to operate its proprietary systems and services within their co-lo facility and pay for running services via AWS.
- Even after the need to continue paying for operating the co-lo facility has ended, it is unlikely that the shift to AWS will start to immediately reduce costs.
- However, the belief is that moving to AWS will, in the longer term, reduce operating costs.
- Whether reduced operating costs lead to reduced costs to users, or whether the savings will be re-invested in making further improvements to the service lay outside of this discussion.
- Right now the focus is not on driving down costs or making service significantly better, but is solely the work of getting everything transitioned. Lowering costs, making more efficient use of the underpinning capabilities provided by AWS will come after the migration work has been completed.
What Happens to the Old Hardware / Facility, Post-Uplift?
- Several years ago, LL consolidated all of their hardware and infrastructure into a single co-location data centre in Arizona.
- Most of the hardware in that facility is now so old it has depreciated in value to a point where it is pretty much worthless.
- A specialist company has therefore been contracted to clear-out the Lab’s cage(s) at the co-lo facility and dispose of the hardware.
- As a demonstration of LL’s drive to protect user data, all drives on the servers will be removed under inspection and physically destroyed via grinding them up on-site.
What Has Been Uplifted So Far and What have been the Issues / Lessons Learned?
- Uplifted services include (note, this list includes service mentioned through other Lab meetings):
- The majority of all web properties (e.g. log-in, Destination Guide, Place Pages, “old” Linden Homes web control panel pages; the Land Store; SL Marketplace, etc.), with the remaining properties expected to be transitioned possibly by the end of November 2020.
- A wide range of back-end services: Inventory; the avatar Bake Service; Group Chat service (which has resulted in group chat reliability issues), etc.
- Approximately (at the time of writing) one-third of all regions on the main grid (Agni), as well as much of the Beta grid (Aditi). This will be increased – see below.
- While there have been issues, these have been relatively few, and have tended to be tied to those services which, for whatever reason, have had to be transitioned to AWS in stages, rather than in a single move, or where services running on AWS have to communicate with services still running within the Lab’s co-lo facility as they await their turn to be transitioned
- Another group of issues are down to the physical hardware / infrastructure environment within the co-lo that has allowed things to happen in certain ways, rather than being directly managed. This is something that cannot be replicated within the AWS environment, leading to problems and the need to make changes.
- Fixing bugs apparently arising from the cloud work has also led to net improvements to SL as a whole., such as with region crossings: a specific issue of avatar attachments being lost when crossing between regions in the co-lo and AWS led to a fix to a deep-set, long-standing bug that largely improved region crossings right across the grid.
- Unfortunately, due to the original bug, a lot of vehicles have been scripted to work around it – and these are now having issues as a result of the fix.
What Will be the In-World Improvements (“lag”, etc.)?
- The most direct answer is “We’ll see.”
- “Lag” is a complex issue, and much of it is not simulator-side (it is the result of things like viewer settings, the hardware the viewer is running on, network connectivity – including things like the number of hops between the viewer and the SL servers, the amount of traffic, etc.).
- Network connectivity is a particular factor, as the new location for all of the SL services is (initially, at least) in Oregon (NW USA), which can include further steps in reaching them. This may improve things for some users or make things slightly worse for others. However, the hope is these will not be very conspicuous changes where network times increase.
- This is liable to be true for those users not in the continental United States.
- For users outside of the United states, and longer term, is that once SL is running smoothly via AWS and LL has had time to analyse things, is that regions could become more geographically disparate to better serve their primary audience, which may help overcome some of the network-related latency.
- There is currently no time-table for when experiments like this might start, if it is something LL pursue.
- “Lag” is also somewhat subjective, in that expectations from new users differ from those of established users; land owners may have different expectations to vehicle users., etc.
- As such, changes / improvements, etc., are hard to predict. Some general performance improvements have already been noted – as with script processing, and there may be more. Other aspects may remain unchanged.
- The hope is that overall, and allowing for individual specifics, most people will find things be at least as good, if not better than, their current in-world experience.
- One of the things the migration is allowing, is the addition of more logging / monitoring capabilities so that the Lab can better ascertain how well (or otherwise) simulators and services are running. The Lab is liable to spend “quite a lot of time” in 2021 doing just that.
How are Regions Selected for Uplift? Can Requests be Made?
- Mazidox has essentially been the Tzar of Uplift, determining which regions should be migrated.
- Estate / region holders can file a support ticket if they have a region / regions they wish to have migrated to AWS.
- Outside of the initial Blake Sea migration (which started by cloning those regions to Aditi and then testing them on AWS), regions are not selected in terms of their in-world geographic location (e.g. a Mainland Infohub and its surrounding regions).
- Rather, initial uplift work was focused on transitioning regions that could be used to solve a specific issue (e.g. Blake Sea and region crossings) or to investigate a specific question (e.g. how do AWS servers manage high-volume / high-memory regions, is it possible to share CPU cycles between simulator instances, etc.).
- More broadly, regions for migration have been placed within a cohort similar to the Release Candidate channels used for simulator deployments, and then transitioned.
- This past week has seen the three primary RC channels – BlueSteel, LeTigre and Magnum – transitioned to AWS (on Wednesday November 4th, Thursday November 5th).
- Tuesday, November 10th is liable to see a portion of the the simulators assigned to the SLS channel proper transitioned to AWS, meaning that a majority of the main grid will be hosted on AWS.
With AWS, Can the Viewer’s Bandwidth be Set to More than 1500 Kbps?
- Short answer: “we really don’t know”.
- The Bandwidth slider was actually implemented to deal with a issue with sending UDP traffic (UDP at the time being SL’s primary means of data exchange between the viewer and the simulator) and potential network congestion that could result.
- Today, the majority of SL data traffic is handled via HTTP, which handles congestion automatically – and the bandwidth slider has absolutely nothing to do with this.
- A lot of this HTTP traffic has also been moved away from running through the simulators (it is routed via the Lab’s contracted Content Delivery Network(s) – CDNs.
- Thus, the likelihood is, changing the slider is unlikely to have a genuine impact on performance, one way or the other – although this is caveated by the “we really don’t know” comment.
Are the New Region Offerings on AWS? Will they Slow the Transition Work?
- Currently, support tickets must be filed for new regions, but the Land Store will be re-opening soon.
- If a new region is not running on AWS when delivered, it will be in the very near future.
- No, provisioning new regions will not slow the work in transitioning the rest of SL to AWS.
Will AWS Allow for New Land Products?
- There is the potential for a range of new land products to be offered.
- AWS presents considerable flexibility in this regard, and there have been internal discussions.
- However, the answer at this point in time is, “Maybe. We’ll see.”
What’s Next on the Technical Side, Post-Uplift?
- Mostly clean-up following the transition, together with performance tuning.
- There is a backlog of ideas and feature requests, some of which will likely be tackled – but no details on what they might be at this time.
- [It should be noted that outside of the cloud migration, a lot of work continues to be put into the viewer, which will continue to see updates and new features and longer-term projects to improve things like performance.]