Data Lake 101

Forbes estimates that we generate more than 2.5 quintillion bytes of data every day, and the pace is only accelerating with the advent of more Internet of Things (IoT) devices and applications cropping up almost everywhere, paired with the rise of cloud computing in mainstream use.

And this is the case not only with consumers but more so with enterprises, big or small, in almost every part of the world. Big Data has come, and the question now is: are we ready for it?

Enter the data lake.

An AWS article defines a data lake as a “centralized repository that allows you to store all your structured and unstructured data at any scale.” What this means is you can dump practically anything onto a data lake: media files, PDF files, word documents, database dumps, et cetera, and it will not have any problems dealing with it.

Think Google Drive running on steroids. Yup, it is a prime solution for Big Data.

How does a data lake solve our Big Data challenges? On a data lake, you can store your data (regardless of its volume or size) as-is, without having the need to perform any extensive pre-processing or transformation prior to data ingestion. Having this flexibility means start-up time for a data lake is shorter than traditional data warehouses, great if you want to get one’s feet wet on the technology. And once in the lake, your data is now ready to run various types of analytics, from dashboards to visualizations, even machine learning.

How to use a Data Lake in my Organization

To apply a data lake solution to your organization, we at Maroon prescribe the following methodology, which we have used for multiple customers, ranging from country-wide regulatory bodies to health insurance companies.

Baselining

As an essential first step in any data lake project, an inventory of the entire universe of data sources available within the organization should be performed, alongside an assessment of the level of quality of the data. Are there duplicate data sources? Are we storing unclean data? What is the technology landscape and architecture of the organization? How are things interconnected? Answering these questions is very crucial in having a confident assessment of where things are, and what we are really working at.

Ingestion

Once a baseline has been set, the next step for the organization is to methodically extract data from the identified silos or sources and then map their relationships. This step is crucial especially for a very unstructured approach like the data lake so that the succeeding steps will b easier and more efficient.

Initialization

With all the relevant information loaded in the data lake, the next crucial step is to now work the data. First step for this phase is data cleansing. Coming from multiple sources, most probably the data will be very dirty, with lots of collisions and duplicates. Depending on the overall cleanliness of your data, data cleansing may be performed manually or automatically, using scripts and/or analytics tools. Once all cleaned up, the remaining data sets will now be ready for merging and consolidation, which at the end prepares us for the more exciting part of this data lake journey.

Insighting

With a functional data lake now at the fingertips of your organization, a lot of the magic may now be performed. With a highly scalable data lake setup, real-time searching can easily be done, even on gigabytes or terabytes of information. Interactive Dashboards & Visualizations can also be performed, using tools such as Tableau or Power BI. With a very good data scientist, deeper analytics such as Customer Profiling and Sentiment Analysis may also be done, all using information coming from the data lake.

The value of a Data Lake

The ability to harness more data, from more sources, in less time, and empowering users to collaborate and analyze data in different ways leads to better, faster decision making. Examples where Data Lakes have added value include:

Improved customer interactions

A Data Lake can combine customer data from a CRM platform with social media analytics, a marketing platform that includes buying history, and incident tickets to empower the business to understand the most profitable customer cohort, the cause of customer churn, and the promotions or rewards that will increase loyalty.

Improve R&D innovation choices

A data lake can help your R&D teams test their hypothesis, refine assumptions, and assess results—such as choosing the right materials in your product design resulting in faster performance, doing genomic research leading to more effective medication, or understanding the willingness of customers to pay for different attributes.

Increase operational efficiencies

The Internet of Things (IoT) introduces more ways to collect data on processes like manufacturing, with real-time data coming from internet connected devices. A data lake makes it easy to store, and run analytics on machine-generated IoT data to discover ways to reduce operational costs, and increase quality.