Hari Kumar

The traditional and relational database world got a rude wake-up call with the introduction of Hadoop-based distributed databases. The new world promised flexibility (ability to handle structured and unstructured data), scalability (ability to add on volume without compromising performance) and significant cost advantages (ranging from 50 percent to 80 percent reduction in platform cost).



Data Lake completely replacing the Data-warehouse

Data Lake working in parallel with the Data-warehouse

Data Lake as an ingestion zone which feeds into other downstream applications including Data-warehouse



A lot of organizations jumped on the bandwagon and decided to explore their options. Building a Data Lake on a Hadoop distribution became a critical project for most CDOs and CIOs. Over the years, we have seen broadly three approaches to building out the Data Lake:

In a few exceptional situations, we have also seen the traditional Data-warehouse feeding into the Data Lake

Most organizations that had a legacy Data-warehouse started exploring one of these options and apart from the high tech, digital natives and telecom industry (where data volumes are inordinately high and cost advantages were significant), most industries settled for the Data Lake working in parallel with the Data-warehouse to test out use cases which were earlier not possible on the relational database.

It is early days yet to call any of these Data Lake build-outs a failure, but one issue that keeps nagging most organizations that have embarked on the Data Lake journey is the consumption and the ROI on the Data Lake investments. Unlike relational Data-warehouse, most organizations treated their Data Lake programs as data ingestion programs with the underlying assumption that once the data is ingested, the consumption use cases will come automatically and adoption is a given.

This consumption has unfortunately not happened to the scale and extent that organizations would have liked and justification of value from these investments have proven difficult at times. The maturity levels of governance tools in the distributed data world (data quality, metadata) has also not helped. Data Lakes have huge potential for the future given that data volumes are increasing at over 50 percent for most organizations and 90 percent of the analysis in the next 4-5 years will happen on unstructured data that falls into the sweet spot of Data Lakes, hence it is important for organizations to get it right.



Treat it like a business project: The Data Lake is also a business project and it should not be treated like a technology project. Thinking through just the data ingestion and building the data pipelines is not all as far as the Data Lake is concerned. Define the use cases and the associated ROI from the Data Lake. While estimating costs for the build, do not ignore the cost associated with use case development and deployment including training.

Think through the Architecture: What is the purpose that is being served with the Data Lake? If it is cost, the architecture approach should be to replace the Data Warehouse. For e.g. if the purpose is to only handle the non-traditional use cases, it should be a parallel approach or a Data Lake feeding into the Data warehouse approach. Using the right tool for the right problem helps – traditional Data-warehouses are still the go to solution for areas like Risk Management and Finance reporting where accuracy, lineage and relational integrity is paramount.

Phased approach vs Big Bang: While a big bang approach looks lucrative from a project costing and effort standpoint, it is best to break up the buildout into logical chunks to avoid inordinate delays and end user and leadership fatigue.

Evaluate all platform options: While most organizations decide on an on-premise deployment to avoid data privacy and security considerations, there are options available on cloud with pre-built configurations and a plethora of micro-services which provide flexibility and speed-to-market, and a shorter learning curve. If regulations in the industry permit, cloud based options should not be ignored.

Don’t ignore Data Governance: For the last 25+ years that Data Warehouses have been in existence, the main issue that one keeps hearing is around the quality and consistency of the data. The ingestion first approach to Data Lake buildout is only going to exacerbate the problem by pushing the Data Quality and Data Management issues downstream. Only strong governance teams and processes can help prevent the Data Lake from becoming a Data Swamp.



Given the learnings from all the implementations, here are a few pitfalls that organizations can watch out for when they embark on their Data Lake journey:

Like all other technology projects, Data Lake projects also have to be thought through in one’s own organization’s context and there is no one-size-fits-all. It is important to understand the organizational needs and align the technology accordingly and then build options for those specific needs. Else one will risk investing in too much technology or embarking on projects which do not give business value.