How big is your data?

One of the major problems with the term big data, is that “big” is relative. If you’re at a conference or reading articles about big data, you never get informed how much data you need to be dealing with in order to consider using the tools they are talking about. You’re probably getting all excited about using new tools, buzzwords flying everywhere and you’re thinking “I should learn all these things, my data is huge!”. Here’s a tip, it’s probably not that big.

Tools like hadoop, and EMR which makes hadoop easier, came about because scaling vertically is expensive and scaling horizontally is was hard. So they sought a way to make the latter easier. The problem with making scaling easier is that people don’t try to optimise first. That can be fine if you’ve got more money than time, you always have to make a trade off in those situations. If it takes you a month to optimise something and you’re strapped for time, just throw another 10 servers at it, things will be fine right? Well maybe for 6 months.

To crunch some real numbers, here at REA at the moment we gather somewhere in the vicinity of 50-60 million events per day on our websites. If you put all those events into flat files, like you would need for hadoop based crunching, you are talking about 80gb uncompressed data for an entire month. We would be looking at around 1TB of uncompressed files for an entire year of data in this scenario. I can assure you, that is not “big data”.

If you design and implement your systems well enough, you should be able to crunch that data in ways that don’t require complex tools like EMR. Things like that come with massive restrictions, management and debugging of them is painful, meaning the “cost” of such “helpful” services because much more than the monetary cost of the extremely expensive servers you are throwing at this thing.

The vision here at REA for the past year or 2 has been “microservices”. You may have heard of them? Why not design our “medium data” crunching in a similar fashion? You get quite a few benefits from this. You can save all your generated artefacts along the way, and potentially re-use them for other jobs. You can choose the right tool for each particular section of data crunching. You aren’t tied to any 1 tool for such large long running processes. You aren’t hindered by the restrictions of tools like EMR that has security woes and is limited in what it can do for you, so if your requirements change you end up either trying to force the tool to do what it can’t, or start from scratch.

A simple example of extracting your map reduce functions out might look something like this:

Those reducing processes can all be written in separate code bases, so if you know a sweet ruby gem that would help, then roll with that. Or you know a python helper, sweet. You can also scale those up vertically individually and only be paying extra for that 1 process. The best part, you can re-use the artefacts generated by those map processes, and help out your analysts by providing them in flat file format for tools like R, less repeated work for everyone! Oh, and you can run multiple reducers in parallel if they end up using the same data set (because of that data re-use ability we just gave ourselves!)

Now I can already hear the mumbling, some of you concerned, what about “realtime” big data crunching. My theory is that for realtime insights, you are going to be working with smaller windows of data. For example, if you had an insight algorithm that relied on events happening right now, it’s highly unlikely that same algorithm would care about stuff that happened 5 months ago. Obviously the smaller the data sets, the faster these things run, so decide early what size time window you need for your model to work and that will determine how “realtime” you can get with your results. I’ve yet to find a need for realtime data crunching on data sets any larger than a few weeks. But I’m sure the above architecture could accommodate for such a situation still.