Rabbits, they are small, the are cute, they are fluffy…. and they are terrible fighters*.
Which is why evolution has favoured the skittish amongst them.
The rabbit which was overly cautious and ran away and the first hint of danger survived and went on to produce more rabbits. But the rabbit which was more laid back and waited until it was sure it was in peril before trying to flee did not. Of course running away does have a cost, and if you spend all your time running at the slightest sound you’ll expend a lot of energy and have no time to nibble at the grass and gain more**.
So what do rabbits have to do with Auto Scaling Groups (ASGs)?
As the name implies ASGs are able to scale automatically. When the group detects some set condition it can react accordingly. Just like a rabbit hearing a fox sneaking through the nearby brush.
For the most part we’ve used the metric of CPU to indicate when we should scale up***, and we’ve been pretty conservative about it.
For example, one of our applications had the following policy:
- Scale up after 2 minutes of > 80% CPU, 1 at a time.
Two minutes of 80% CPU is the rabbit waiting until it’s between the foxes teeth before it tries to run away, and one extra instance at a time is more of a gentle amble away from danger, not exactly a sprint.
Of course hindsight is a wonderful thing, and the scaling policy in place seemed reasonable at the time. In fact even having a scaling policy is a far step better than a lot of the applications we have deployed.
So what should our rabb^H^H^H ASGs do?
Another application we have deployed is our Dynamic Images service. This service is responsible for serving and scaling the photos on our site in real time.
If this service doesn’t respond quickly enough our site won’t have any photos, and we really don’t want that. So the ASG policy looks something like this****:
- If the CPU is > 35% for 3 minutes, add an extra instance and wait 90 seconds before adding more.
- But if the CPU is > 48% for 1 minute, add 5 extra instances, and wait 1 minute before adding more.
- If the CPU is <20% for 15 minutes, take one instance away, then wait 90 seconds before taking more.
This is much more rabbit like behaviour:
- If you hear a noise, stop what you’re doing, perk up your ears, and pay more attention.
- If you see a fox, run. Run fast.
- And only when things have been quiet for a long time is it OK to very cautiously come back out.
So what’s the cost of this behaviour? About $0.093 per hour for each instance we spin up. For that price we can afford to be a little jumpy.
* Most rabbits, of course there is always an exception.
** Clearly the rabbit in the first photo doesn’t have a lot of danger in its life.
*** Whether or not CPU is the right metric to scale on is beyond the scope of this post.
**** Amazon recently released Scaling Policies with Steps to make this behaviour even easier.