One of the major problems with the term big data, is that “big” is relative. If you’re at a conference or reading articles about big data, you never get informed how much data you need to be dealing with in order to consider using the tools they are talking about. You’re probably getting all excited about using new tools, buzzwords flying everywhere and you’re thinking “I should learn all these things, my data is huge!”. Here’s a tip, it’s probably not that big.
Tools like hadoop, and EMR which makes hadoop easier, came about because scaling vertically is expensive and scaling horizontally
is was hard. So they sought a way to make the latter easier. The problem with making scaling easier is that people don’t try to optimise first. That can be fine if you’ve got more money than time, you always have to make a trade off in those situations. If it takes you a month to optimise something and you’re strapped for time, just throw another 10 servers at it, things will be fine right? Well maybe for 6 months. Continue reading