Friday, April 19, 2013

A General Introduction to Big Data

Today, before we get deeper into Data Mining, Analytics and other industry related concepts, let us understand a few common terms which we have already heard about, but never emphasized on what it actually meant? Sounds interesting, yes I believe it should. A few words like Big Data, Data Mining  and Analytics are not new to most of us. Almost every one who keeps interest in tech news should have come across the term 'Big Data' , it was a very talked about term recently over Facebook,Twitter & Techcrunch.

Lets begin with 'Big Data'

So what exactly do we understand when we say Big Data?

A very simple insight which any one could conclude is that, it deals with data which really BIG in size. Yes, in a way it does.

In a very simple layman definition, I would say that data is managed by a special software called DBMS which runs on a machine(desktop or a server). This software has certain limitations on the size of the data that it can handle. When the size of data goes beyond the limit any traditional database system can handle, we call this  kind of data as Big Data.

In a more precise way, Big data is a collection of data sets so large and complex that it becomes really difficult to process using on-hand database management tools or traditional data processing applications. The challenges include capture, curation, storage, search, sharing, transfer, analysis and visualization.

Now, where did this Big Data come from?

Nowadays, almost all organizations are interested to capture data from every transaction in the business they do and store it, so that they can understand better how well is the business doing. This data could be anything like logs or bills, and is very diverse when compared to another organization. Lets take an example, a grocery shop may have database that captures all the bills from the daily business it does and then on may be on a monthly basis he could understand simple statistics of his business like total revenue, total profit, no of customers etc. If you move on to a larger organizations, people capture more data from the business they do. They are normally more interested in looking how well is their business doing compared to Year on Year Level (YoY) or Quarter on Quarter level (QoQ). And then may be they would go even further, by forecasting how much business they can do in future and where they need to improve.Analysis doesn't end here, facts like what happened to the business, how did this happen and why did it happen can all be figured out by digging deeper into the data. Now that would catch your attention! Never realized that you could learn so many things from your own data, right?

  Yes, many organizations gradually understood that they could understand more about their business and do even better, if they really knew what happened, how it happened,what will happen in their business. And hence they started capturing more and more data from every possible dimension. This process continued and over a period of time the size of the data revolved from a few Gigabytes to a few Terabytes. When, the size of data reached a little higher level than expected, the industry figured out the limitations faced by the traditional database systems and therefore concluded the need of more powerful and more sophisticated tools needs to be developed to handle the growing data. And that was where Big Data was first talked about.

Later the size of data started increasing exponentially. All of sudden, many organizations understood the importance of their own data and hence started capturing all possible information they can from the business they do. And this growth now seems to get even faster. May be we would also need to find a better replacement for the word 'Big' in Big Data, something like 'Gigantic Data'. Sounds funny?, trust me this is very much possible!

So, what did people do handle Big Data?

Since it was obvious that a normal machine could not handle this huge data sets, the industry then proposed the need for a tool which is powerful enough to handle large data sets and also allow all necessary operations on them so that data could be processed , cleaned, filtered, merged and analyzed. These tools also needs to process large data sets parallelly to speed up processing time required. A few big players came ahead and released really sophisticated tools, that could handle really large data. So Oracle launched Exadata, Teradata launched a tool with its own name - 'Teradata', Google came up with their own model - MapReduce and so on. Teradata can handle databases to the size of 10 to the power 12 Gigabytes!.

    Lately, the biggest innovation in softwares to handle Big Data was brought by Apache and Google. Apache developed a tool named 'Hadoop' which can handle exponential growth of data size and allows faster processing with more sophisticated functionality to handle data. Google also came up with MapReduce - A programming model typically used to process large data sets over a distributed environment. Apache also launched Hive, which is another layer over Hadoop and provides functionality for data warehousing. There are many more tools and softwares which were launched by different players to handle huge data sets and are not discussed here.

     But one thing to notice is that, when you have Big data, your demands are also bigger. You will no longer expect simple results like aggregation and roll ups on a large data set. There is always a need to present this complex data in a simple and more structured format. Hence, we need better and more sophisticated tools to visualize data. A few tools which help in better reporting and and data handling are :- Pentaho, Jasper Reports, DAS - Datameer Analytics Solution, Tableau, Platfora and many more. Apart from these, there are many more tools and packages which help us to visualize data more effectively.


   So that's all I could do, to introduce in a simple way topics like - Big Data and Hadoop. I know, we didn't talk much about Hadoop, but eventually we will. There would be one more article on big data, in which would cover a few other important topics. But for now, this brief introduction would help me to introduce to you basics of Data Analytics. In the upcoming articles we would talk about a brief introduction to Data Analysis and Business Intelligence.

Stay tuned :)


Post a Comment