Recently we launched a recommendation engine, which was built using AWS Serverless technology. The journey of implementing this solution turned out to be an interesting one on a number of levels. Since its deployment into production, we thought it would be a good idea to share some of our lessons.
Essentially the system transforms a very large dataset into smaller ones that are used to create audiences or data segments which are used for hyper targeted EDMs.
To get from the initial state to the final state, the data is transformed over several stages using 8 Lambdas.
As we coded each part of the system, we found ourselves reusing a few approaches. For example, the starting point of our process is with a S3 bucket holding around a million json files. These files are uploaded into a bucket daily and once the upload is completes, it sends an SNS message to a Topic, which triggers our first Lambda. The SNS message contains the path to the S3 bucket holding the raw data. Initially the job of this first Lambda was to read the contents of the S3 bucket and then store the contents of the files in a table in DynamoDB.
In that early stage, we thought we were implementing this:
We quickly discovered reading a large number of json files would take longer than the 5 minute maximum running time of a single Lambda. We then tried to breakdown the number of files into groups that could be processed within 5 minute blocks. The problem with that kind of approach is having to keep track of the processed files.
- How does one pick which blocks of files to pick?
- Is it best to block out files by filename or a specific number of files?
- Is it worth keeping a cursor file to track the last processed file?
- How could we scale the file selection algorithm so that the Lambda always runs within 5 minutes, when the number of source files increases?
We needed an approach that was independent of number of source files and a way to manage the 5 minute Lambda running time limit.
The approach we took involved the Lambda tracking how long it had been running for, and then after a particular threshold (in our case, 4 minutes), the Lambda shuts itself down and sends an SNS message to the same Topic to re-trigger itself. Thanks to the marker parameter of the S3 ListObjects call, the data in the SNS message has enough information for the Lambda to continue from where it previously left off. This meant the same Topic, Lambda and SNS message could be reused.
Another bottleneck in this initial first Lambda flow, was writing to a DynamoDB table. We realised we needed a way to decouple the Lambda from writing directly to the table since trying to read from S3 and write to a DynamoDB table at the same time was impractical. This task turned out to be a good use case for Kinesis; the path of each file was pushed to Kinesis streams, then concurrent instances of a second Lambda are triggered by each shard.
Our initial design morphed into something looking like this:
Trigger on completed processing
Once our raw data was in DynamoDB, we needed a way to kick off the next part of the flow. DynamoDB has a feature called streams which can be used to trigger a Lambda on an event, for example inserting an item into a table. This is not the behaviour we were after; we needed the next Lambda to trigger once all the data had been uploaded to our DynamoDB table, not after each inserted item.
We decided to use a cloud watch alarm instead. The alarm would trigger if a Lambda function had not run after a set period, in our case 15 minutes, and then send a message to a SNS topic. The SNS topic was configured to trigger the next Lambda.
This part of the system looked as follows:
We used these testing frameworks:
- Chai – great library for assertion
- Sinon – mocking, spies and stubs
- Cypress – used to do our end to end testing
Our testing approach evolved into this:
- We prefer to write unit tests without mocks, as they are easier to read and result in a better design. We do have tests that employ mocks, stubs and spies, but to reduce the number of those kinds of tests we found that passing dependencies into functions or classes is better than creating them within the function.
- Be aware of false positives (tests that should be fail, but pass).
- Favour pure functions and in general try to follow a functional approach as it makes the tests easier to write and the code easier to understand.
Many Lambdas later, we have a system that transforms our raw data into useable information. Heres a list of the other Lambda functions:
- Lambda that generates a list of interesting suburb segments and then feeds them into a Kinesis stream
- Lambda that generates metadata based off the suburb segments and stores that metadata in S3. This is important since its used by our React frontend
- Lambda that generates a manifest file used by our React frontend to point to the relevant segment metadata
- Once bookings are made through the React frontend, a booking creator Lambda runs several times a day to generate the segment data and store it on an S3 bucket for later consumption.
- Uploader Lambda that uploads the segment data to a 3rd party platform and updates the status of the processed segment in DynamoDB.
- Notification Lambda that sends success or failure notifications based on the segment status.
The next step was making this information available to our React front end.
At this point we had a number of Lambda’s which had produced smaller datasets and we needed a way to interact securely with this system of Lambda’s. We used AWS API Gateway because its a service that makes it easy to publish, maintain, monitor and secure API’s that could require scaling. It also acts as a front door to our application to access the datasets and functionality from our backend system of Lambda’s. It handles things like authorisation and access control, monitoring and version control.
We also needed a simple way to deploy and perform configuration tasks. For this we used ClaudiaJS which enabled us to deploy very quickly to different environments (staging and production), and worked well to proxy requests to GraphQL.
We decided to use GraphQL as its a great way for us to query our data simply. It enabled us to have a separate data layer where where could define our own schema and have some really readable simple queries. Its query language is simple to understand and flexible enough to allow our API to evolve easily.
We forged forward with the philosophy that our Lambda’s perform a single task only, as expected the final solution ended up with 9 Lambdas. The advantages of small codebases for each Lambda is clear:
- simple to write and understand
- easy to to test
- easy to scale
- fast to deploy
However having lots of Lambda’s also creates some disadvantages:
- how do they all fit together
- many repositories and build pipelines are generated
So far the disadvantages haven’t been problematic. To manage the complexity of how they all fit together, whenever we created a new Lambda we made sure the documentation (ie Readme.md) and architecture diagram are kept up-to-date.
The final system resembled this:
Like testing, this topic could be its own blogpost. We made our Lambda fast and easy to deploy partly because of how the projects are structured and our choice of tooling. Thanks to our Delivery Engineering Team we made use of their existing ansible templates, and used Buildkite for our continuous integration pipelines.
A great pattern to follow is to put all this configuration within the Lambda repository so that everything is in one spot. The Lambda is essentially then a Lambda in a box, as everything thats required to build, test, deploy and run the code is in a single package.
Monitoring and metrics
- Name everything well. It’ll really pay off dividends when your future self or another developer is troubleshooting an issue
- Keep the Lambda readme document up to date. They should be simple and to the point, which will make maintaining them easier.
- Keep your diagrams up to date. We use draw.io. It allows the diagram to be exported as a png and xml data. We keep those two files with the project, so that its all together and makes updates easier.
- Testing is hard, but persist!
- Create scripts that can be used to trigger your Lambda from the command line. Its a great way to do some integration sanity checking, and the person doing the QA on the Lambda will become your best friend 🙂