Back to list
Lv.2

Data Lake

Data Lake

A central storage repository that holds all types of data in their raw, original form.

In Simple Terms

A data lake is a system for storing all kinds of data — text, numbers, images, videos, and more — in their original, unstructured form. It continuously ingests things like app usage logs from smartphones or temperature readings from factory sensors, accumulating them second by second. Rather than discarding anything upfront, it keeps everything in one place so you can pull out just what you need for analysis whenever the time comes.

Behind the Name

The name 'Data Lake' combines the words Data and Lake. It evokes the image of raw, unprocessed data flowing in like water filling a vast lake — collected and stored exactly as it arrives. Unlike a warehouse that organizes information before storing it, a data lake acts more like a giant reservoir that accepts everything first.

Take a Closer Look!

A data lake is a centralized platform for storing all types of data regardless of format.
Its defining characteristic is that data is saved in its raw, original state — without any processing or organization beforehand.

A commonly compared concept is the data warehouse, which functions more like a neatly organized warehouse.
A data lake, by contrast, stores information without imposing a fixed structure upfront, making it flexible enough to support virtually any kind of analysis later on.

It's particularly well-suited for collecting large volumes of real-time data, such as smartphone usage logs or factory sensor readings.
In machine learning and AI, it's widely used as the foundation for handling the massive amounts of raw data needed as training material.

That said, simply accumulating data without any governance can lead to what's sometimes called a 'data swamp' — a state where no one knows what's inside or where to find it.
The key to using a data lake effectively is establishing clear rules for what goes where and keeping the contents well organized.