Fewer AWS accounts please (aka the “goldilocks” strategy is back)

REA’s journey with Amazon Web Services (AWS) began in late 2010 when we started experimenting with using the cloud for our dev/test infrastructure. In 2013 we launched our first cloud-only production infrastructure to handle the dynamic resizing and serving of our images. Since that time we have adopted an IT strategy that involved transitioning all systems to the cloud and have therefore run a hybrid cloud and data centre platform ever since. More recently we have also embraced micro-services which means the volume of systems that we run in the cloud has exploded. This blog covers how our usage of AWS accounts and VPCs has changed and what we propose to do next.

Wind back to 2013…

In 2013 we modelled our accounts closely on the environments used in software development at the time.

We used one account for dev/test where our Continuous Integration servers ran and per team test environments could be provisioned. This account was introduced around 2011 and through it we gained experience with deployment tooling and cost management challenges.

Additionally, we used one account for staging (or pre-production) which was networked with our staging environment in the data centre. And finally we had one account for production which, again, was networked with our production environment in the data centre.

Data centre connectivity was provided through Direct Connect and each AWS account typically featured one Virtual Private Cloud (VPC) with a number of subnets: in most cases a public, private, and services subnet per Availability Zone (AZ).

This worked out really well for us. We managed our own stopinator to scale instances up and down around working hours, engaged in right sizing to a degree, and took advantage of reserved instances. At this stage around 20 teams were using the cloud for dev/test, continuous delivery was taking off, and the success of our “images in the cloud” initiative had more teams interested in the cloud for staging and production. We were also embracing devops with “ops folk” embedded in nearly all teams and developers starting to take more responsibility for deployments and production ownership.

Our IT strategy around cloud adoption was announced internally and everything was great.

Until it wasn’t.

2014: operational bottlenecks, hard limits, bitcoin mining!

Early 2014 we started feeling constrained. The 100 S3 bucket hard limit per account that existed at the time was causing some grief. Teams wanted to move fast, embrace devops, and have more autonomy. Conversely managing costs and cost reduction centrally was not working. And (embarrassingly) we had a few leaks of shared developer credentials which allowed some bitcoin mining to occur in the ever expanding catalogue of regions available.

Thus the “Team Managed Infrastructure” project (quickly coined TMI) was spun up and in approximately 4 months we transitioned to a world where we had a procedure for creating a new AWS account (including rolling out standard cloud formation templates for VPCs, roles, and users) as well as a federated identity approach to allow our internal LDAP system to power AWS account access. No longer were users relying on the long lived credentials provided through the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables, instead we obtained temporary credentials from the AWS Security Token Service.

Teams were free to start creating accounts and very loose suggestions were provided around separating production from dev/test environments.

By late 2016 the horse had well and truly bolted…

By this time we were managing over 100 AWS accounts for a mixture of dev/test, training, and production being used by just over 40 teams with fairly consistent observable growth:

Growth of accounts

So far things are still looking up.

Teams are now fully responsible for supporting their systems during business and after hours and have a high degree of AWS expertise. Once the basic account setup has been completed by a central team (which takes less than a day) the new account owner is free to customise and setup the account and VPCs within that account to suit their needs. Hard limits are certainly no longer a problem. Each team is also responsible for routinely ensuring that all of their accounts are compliant with our cloud security checklist .

The downsides are hotly debated as they are not consistently viewed as downsides at all. These fall under:

  • which approach to use when connecting systems across accounts;
  • efficiency concerns.

To understand connectivity challenges: imagine one team has created a fabulous API “B” that is deployed in one VPC in one AWS account and a second team wish to connect their API “A” from a separate VPC in a separate AWS accounts. With many teams and many micro-services this is an increasingly common desire.

tmi_a_calls_bOptions for connectivity include:

  • connect over the public internet
  • connect via DirectConnect
  • connect via VPC peering

Each time this type of problem cropped up we had a separate discussion regarding the pros and cons. Differing opinions were available from our infrastructure team and as each team had configured their VPCs differently the implications of opening up connectivity differed and had to be explored separately. It was not uncommon for an account to be peered to 5 other accounts. The direct connect option also was only available in Sydney for us.

Looking back through our internal wiki I can see this topic coming up again and again. Each documented reference likely was the result of a great deal of offline team discussion and confusion. Don’t get me wrong, some conversations are highly valuable and should be had over and over as approaches and teams change (e.g. those around architectural principles) but if the conversational tone is trending towards venting and frustration there is a problem.

The overall level of inherent networking complexity was increasing and it felt like the barriers between the services did not exist for good architectural reasons – some of us were asking whether A and B could just be within the same VPC and the same account.

To understand current state we took this example of a few of our production systems – each coloured area is a separate AWS VPC in an an account (one is the data centre) and arrows between areas symbolise connectivity.

tmi_workshop_whitelabel

It seemed that many systems were deployed in a VPC that was convenient at the time for that team to develop and operationally manage that system. Would life conceivably be simpler if we considered structuring our VPCs and deployed systems based on business domains?

Taking the provided example we identified how system connectivity would need to occur with two main accounts representing two key business domains:

tmi_workshop_alternative_whitelabel

Switching to the second problem, efficiency concerns, we need to explore concerns related to duplicated effort as well as the implications from a risk perspective.

If we spend an hour each quarter auditing each account for compliance, was spending a collective three weeks of effort for a single systems engineer the best use of their time? Conversely, how easy would automation of these checks be if every account was slightly different?

We are also investing a lot of effort over and again with definitions of our VPC and other account attributes with a variety of Infrastructure As Code techniques. At the time of writing we have at least 12 separate git repositories for managing AWS account and VPC configuration. These are fairly active in terms of ongoing commits and generally involved between one and 26 contributors (mean 10.25, median 7).

I’m throwing it out there that a lot of this effort has been duplicated and is wasteful.

Where to from here?

We can demonstrate that our account management practices are wasteful and we have theorised that aligning our account boundaries more closely to our business domains will simplify our networking.

To tackle the former we’re creating a new responsibility (led by our infrastructure team and requiring ongoing resourcing) for a team to to extract out the common ground from existing accounts and VPCs and define a standard. This will be done collaboratively with all existing account owners. This definition will cater for our current use cases. Additionally tooling will be created to support application of this standard to new and existing accounts. Explanation of the standard and rationale for choices made will be rolled into our internal AWS training program.

This team will then assist all other teams to apply the standard within their accounts. They will also maintain and evolve this standard through our standard consultation and pull request model.

We will measure and expect to see the following benefits:

  • Time spent maintaining AWS infrastructure should decrease;
  • Opportunities to automate compliance checks should emerge;
  • Movement between teams will require less training/induction.

To support our theory that aligning our account boundaries more closely to our business domains will simplify our networking we’re going to gather more data including:

  • extending our internal catalogue of all accounts to include their purpose as well as the VPCs contained therein (and their purpose);
  • work through more account consolidation scenarios where services are re-homed to remove the need for VPC peering;
  • publish some documented integration patterns and recommendations for common scenarios so teams have blueprints to start their discussions with.