This is the world of Big Data where the volume of digital data is going to double every two years for the foreseeable future. By 2020 there will be eight billion people on earth, using 20 billion devices and communicating with 100 billion connected things. As Big Data blossomed, organizations began to store the endless stream of data into data lakes. Technically speaking, data lakes are marketed as enterprise wide data management platforms for analyzing disparate sources of data in its native format. The main concept of a data lake is simple: Instead of placing all the data in a purpose built data store, we move it into a data lake in its original format. This usually eliminates the upfront costs of data ingestion and transformation. As the data is placed into data lakes, we then start processing the data by taking advantage of enterprise frameworks like Hadoop. The diagram below clearly describes the sample data lake architecture.
ONE WAY DATA LAKE
Nowadays collecting and storing the data into data lakes is a piece of cake for every organization. The real question is, are organizations able to utilize all the data which is stored in a data lake to make rich decisions? Every day millions of terabytes of data are being generated and stored in the data lakes. The main issue is that the larger the lake, the more difficult analyzing the data becomes. Data pours in every single day, but little to no amount of data is taken out for data analysis and real time decision making. This makes the data lake a one-way data lake where data pours in but isn’t turned into anything. This data lake is at risk of turning into a data dump where there is only a single lineage of data.
Let us consider an example: A multi-national company working on Big Data wanted to track all the phone calls of 500 million people to catch one single terrorist. The good news was that the company was able to track all phone calls and was able to catch that terrorist with the help of data scientists. However, in the process they dumped a massive amount of data including irrelevant records of information inside the data lake.
ARE ORGANIZATIONS USING ALL THE DATA IN THE DATA LAKE FOR MAKING DECISIONS?
The main concern is that all the useful data is hidden to the analyst because it is concealed behind the walls of other information that is not relevant, leading to lost data relationships. Every day millions of terabytes of data are being stored in the data lakes. But only a small amount of data is being used to make rich decisions, making the data lake a data dump with irrelevant information. The main question here is whether there are any frameworks or processes that store only relevant data and protect the data lake from the data dump?
According to Inmon, the father of data warehousing, one solution to turn the data lake into an information gold mine, is to make some transformations on the Big Data before storing it in the data lake. As the meta data changes over time, our goal will be to figure out the best possible meta data transformation rules.