Turning it up to 11!
Posted on
Time for our very first Hutch Tech blog post from our Head of Server Engineering, Geoff! Autoscaling can be a challenge when it comes to designing live services. So here’s Geoff to talk about the Hutch approach (and look out for more tech blogs over the coming weeks!)…
Hi, I’m Geoff Pate, and I’m Head of Server Engineering at Hutch. I’ve worked at Hutch for the last five years, before that I worked on AAA console games, 3D poker games on PC, and other free to play mobile games. Hutch currently operates three very successful games, Top Drives, F1 Manager and Rebel Racing, all running on our bespoke cloud native server stack designed specifically to power live-ops tailored experiences. Today I’m going to talk about one particular challenge in designing our live services - autoscaling.
Introduction to autoscaling
One of the major benefits of running in the cloud, is the ability to take advantage of its elasticity. Gone are the days of having to pre-calculate exactly how much capacity you think you need, ordering hardware and installing it in data centres - now a simple click of a button and more processing power is available instantly! And conversely, as demand for a product wanes, there’s no longer the need to decommission hardware and find it a new home - another button click and your compute resources are returned to the pool, ready for someone else to utilise them.
Taken to the extreme, and as cloud providers have very fine grained billing for compute resources (down to the second), constantly monitoring our applications and bringing compute resources on and offline as demand increases or decreases allows us to only pay for the resources we actually need - this process is known as autoscaling.
Vertical vs Horizontal scaling
A very common approach to autoscaling relies on constantly monitoring the CPU and memory usage of your application. These are generally representative of how busy the application is - how many requests it is currently dealing with. As these resources increase towards a defined limit (e.g. the number of cores available to the application), the performance of the application tends to degrade. At this point, we have to decide how to deal with the limit - we can either scale the application vertically or we can scale it horizontally.
Vertical scaling involves making more resources available to the application. In the old days of running your own data centre, this meant buying a newer, (usually much more expensive) piece of server hardware that came with more CPUs or memory. This was a slow process, and also hindered by limitations of hardware - eventually you wouldn’t be able to fit more resources into a single blade. While the cloud makes it faster to scale vertically (the providers usually have a range of hardware sizes available and ready to go), the same ceiling on resources exist and its generally accepted now that horizontal scaling is a more sustainable solution.
Horizontal scaling involves bringing more instances of the same type of application online, and splitting traffic between them using a load balancing mechanism. In order to do this, the application itself must store as little (ideally zero) state as possible so it really doesn’t matter which instance a request hits. This is how we write all services at Hutch.
Horizontal scaling based on CPU usage enabled us to cope with two key traffic patterns very efficiently in terms of costs:
The enormous demands from huge numbers of requests when new games and updates are released, which then tails off into a more normal traffic pattern as time passes.
Normal traffic patterns around a 24 hour cycle as customers play very little during night time hours, and then build up gradually over the day to peak times in the leisure hours of the evening
The Surge Problem
Despite our success in cost effectively dealing with day to day traffic, and even increased traffic during launch windows as the numbers of users online increased hugely over the course of a couple of days, our implementation was insufficient to deal with a third common traffic pattern we started observing - peak traffic at predetermined times based on special events that may be happening due to the live nature of our games (e.g. Seasonal events - Christmas, Halloween etc, or real world racing calendar events such as F1 Grand Prix weekends).
For example, let’s look at the traffic pattern for a normal day without any special event taking place in Rebel Racing:
It follows a fairly typical sinusoidal wave with the peak in the evening hours when people are at home after work, and the trough in the early hours of the morning when people are asleep. Traffic increases and decreases gradually over the time period. This is a perfect fit for monitoring CPU usage. At Hutch, we try to keep our servers running between 45-55% CPU usage which means we aren’t wasting CPU, but the servers have headroom to deal with load increasing in the time it takes new servers to come on line and provide extra capacity.
Now, let’s compare it to this traffic pattern for a day on Top Drives where an event is ending at a specific time:
The difference is stark. The way the events are designed make it incredibly important to be online and playing in the closing moments to achieve the best possible finish and win the best prizes. This means we often see 3x, 5x or even 10x changes in traffic in the last 3 or 4 minutes of popular events! This caused a problem - our CPU monitoring algorithm was insufficient to cope with this sudden surge in demand.
Possible Solutions
Knowing that even though we allow for the fact that we are running at around 50% CPU utilization before the surge starts, there simply isn’t enough headroom to deal with 10 times the traffic and there isn’t enough time to bring more servers online in time to cope, we went back to the drawing board.
First, we tried a couple of simple code changes to reduce the window of time we monitored CPU for, but this led to an algorithm that produced more spikey responses - reacting too aggressively to normal traffic patterns and bringing too many servers up than were needed at normal times. This was undesirable as it led to increased server costs - the entire thing we were setting out to avoid!
Next, we looked at real world comparisons (eg the surge in demand for electricity at half time in a big sports final as everyone boils their kettle), when the supplier knows when the surge is coming and ensures they have capacity to deal with it. This seemed like a good model to try. We knew when our events were scheduled to end, so we added code to our autoscaler to bring more server instances online 30 minutes before we were expecting the surge - we called this ‘preemptive scaling’. The following night, at 9:30 everything looked great - the new servers came up quickly, and we thought we were ready to cope. Then we hit another problem! By bringing the new servers online before the surge, the ‘normal’ autoscaling process monitoring CPU usage now realised the average load was below our minimum 45% threshold and decided it would be a good time to turn some servers off to reduce costs! By the time the surge came, we were no better off.
Our solution - Preemptive scaling with a safety window
At this point, we felt we were close. We had a mechanism to bring extra capacity online at predetermined times, but we just needed to prevent our normal scaling mechanisms from fighting with it. A little bit more code was introduced to our autoscaler so that whenever it encountered a scheduled preemptive scaling request, it would start a safety window whereby it would no longer scale down for a specified period of time. This worked perfectly. That night the servers all came up in advance, and stayed up dealing with requests within our target success rate (> 99.9%) and with our 95th centile timings averaging between 100-200ms.
Until now, we had been hard coding times into the autoscaler as we fought to handle these surges. The final piece of the puzzle was to automate this process. We added more controls for our live operations teams, so that when they were authoring events, they could specify the expected size. Whenever events were created, they would automatically register a preemptive scaling request with the autoscaler for before the surge was expected, and scale the service appropriately by the configured size.
This model has worked very well for us so far, and allows us to run different sizes of events that end at different times, knowing that the server capacity will always be in place to handle any surge, while at the same time keeping our costs to a minimum.
If you're a Server Engineer or a Unity Games Engineer looking to start a new chapter in your career, check out our current vacancies here: hutch.io/careers. We'd love to hear from you!