No Free Lunch, Even at the Data Lake

Author: David Tufte Posted In: Data

In recent years with the onset of cloud computing resources, data lakes, and big data offerings, capturing and storing a wide variety of data is easier than it’s ever been. We can move data to a data lake and not have to worry about what kind of data it is (relational, document, No-SQL, unstructured data, Cassandra, Hadoop – the list goes on).

With all this data, the types of analysis that are possible are interesting. This also raises the promise of self-service business and end user reporting and analytics again. I say “again,” because self service reporting is a familiar refrain (“Our NEW interface is so easy to use, and we’ve made it simple to connect to a database that end users can do self-server report and analysis creation”). Of course, this is after the IT department has gone through the involved process of ETL (extract, transform and load) into a data warehouse (or several data marts).

Data Power Users

The data structure of the data warehouse and data marts took a lot of time and resources to determine, especially if there were multiple sources of data that need to be included in the business reports. After all that preparation, there were self service report writers, about 5% to 10% of the business users. We call these users the power users. The rest of the business users either used reports that were developed by the IT department, or by one of the power users.

A few years go by, a “New and Improved” user interface was developed, and the same power users were the self service analysis and report writers. Repeat again, and again. Remember, all these years and iterations of reporting and analysis, the IT Department has cleaned, structured, and defined data to be ready for the business users before the power users created their own reports (still only the 5% – 10%).

The Change in Data

Now, with the continued need to do analysis and reporting, the data to report against has changed. We still have the ERP, CRM, and homegrown business systems, but now we also have web data, non-relational and relational from IoT devices, web sites, mobile apps, and social media. Instead of planning, developing a structured schema, and cleansing the data before we load it into the data warehouse, we’re told we can load all this data, structured and non-structured, into a data lake.

We don’t need to define the schema when we load the data, we define the schema when we use the data. This is called “Schema on Read.” Now days, the end users are not just business users. They are data scientists, data developers, and business analysts. This expanded group will be able to do the self service analysis and report writing, using “curated data.”

Curated Data

These two little words, “curated data” are mentioned in passing, like a minor detail. But what do they mean? They mean cleaning, structuring, defining, and determining which elements from the different systems are actually the same (customers, products, locations), but with different id numbers and formats. The phrase “schema on read” means these issues need to be resolved when the reports are created. The 5% – 10% power users haven’t had to do all the prep work before the reports are worked on, but it still needs to be done. Do the structuring and schema design before you load the data, or before you use the data. Either way, “There is no such thing as a free lunch.”