Reputation: 3277

Can an Data Warehouse include a Data lake?

I want to understand data warehouse and data lake more in detail.

It seems to me there is different information to the topic. Inmon defines a data warehouse as

a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management's decision making process

Now I understand that this is just a form of architecture and does not imply any technology. Which means that the underlying data can be any structure that could also be an S3 object storage. Moreover Waas et al. in On-Demand ELT Architecture for Right-Time BI: Extending the Vision proposed a data warehouse with a ELT process of integrating data.

When it comes to data lakes I found the following definition

scalable storage repository that holds a vast amount of raw data in its native format ("as is") until it is needed plus processing systems (engine) that can ingest data without compromising the data structure

taken from Data lake governance.

Now can a data warehouse be a more strict data lake? There has been an argument that a data warehouse must use ETL but according to Inmon the definiten does not include any restriction on data transformation? If data integration can be ELT and the there the transformation is agile e.g. it can be easily extended. A data warehouse looks very much like a data lake.

are my assumption correct or am looking at this from a skewed angle.

Upvotes: 1

Answers (3)

agilmore

Reputation: 56

In modern data warehouse architecture, the data lake is treated as a landing zone for raw data from source systems. This data can be in various formats, such as structured, semi-structured, and unstructured. It becomes the source for the Staging layer of the data warehouse.

In the context of a data warehouse, the data lake's only concern is storing raw data in any format. Other layers of the data warehouse architecture are used to transform the data into a form suitable for analytics, etc.

There may be other use cases for data lakes, especially in data science, primarily for unstructured data.

Upvotes: 0

sputnik

Reputation: 38

Data Warehouse was created to solve the need to do Analytical Data Processing on Enterprise Level and Structured data, it means

data is from throughout the organization and is usually brought to warehouse using ETL processes from various sources
Data in warehouse is structured and managed in format optimized for intensive analytical transformations. Most Warehouses structure data as Columnar Store and provide a SQL type interface to work with data.

Data Lake on the other hand was created to be one stop zone for all your organizations data. Data is in raw, unprocessed format straight from applications. You can also process data in lake either by moving them to warehouse or directly use them in Distributed Big Data processing systems.

So from this we see data warehouse is not data lake

since it does not have unstructured data
can only be used for compute intensive OLAP applications

Upvotes: 0

jmng

Reputation: 2568

A Data Warehouse and a Data Lake are independent systems that serve different purposes, can/should be complementary, and both are part of a larger data architecture. A data lake, as a concept, can be just another data source for dimensional models on a data warehouse (although the technological implementation of data lakes enables direct querying over the raw data).

You can think of a Data Lake as a "landing zone" where several systems dump data in a "complex/raw format", e.g. MP3 files from customer support calls, gzipped logs from web servers. It's meant to sit there for historical purposes and for further processing into a format that can be easily analyzed/reported over, e.g. text extraction from MP3 files.

A Data Warehouse also aggregates data from different systems, but the data is modeled into a format appropriate for reporting (like a dimensional model), its model reflects the business/domain's processes and transactions, and is usually highly curated.

Imagine the case: if you log visits to your online store using web server logs, you could keep the gzipped logs (the "transaction data") in a data lake and then process the data into a dimensional model (like this) which will be the "copy of transaction data specifically structured for query and analysis", so business users can easily explore it in Excel or some other reporting tool.

Upvotes: 4

Can an Data Warehouse include a Data lake?

Answers (3)

Related Questions