Data warehouses, data lakes and data lakehouses: Datastorage 101
While working in the field of Data Governance it is quite likely that you will be confronted with tools and terminology including names that do not explain themselves. These tools and terms then defy all imagination. In this blog series we will address these topics and explain them in an easily accessible way. This week we will focus on data warehouses, data lakes, and data lakehouses.
Introduction
Organisations use processes to create and deliver products and services to their customers. In these processes data is created and changed. A hospital that has to find a patient’s records, the data of an insurer that has to file an application and financial transactions that have to be processed are examples. This data is used regularly and must be available at the flick of your fingers. Being able to fastly find, retrieve and store relevant data is essential. Computer systems that support these business processes need to be optimised to provide the necessary performance.
Transactional systems are optimised to find, retrieve and store individual data records.
On the other hand organisations need information to choose the right strategies, optimise business processes and forecast consumer behaviour. To provide these insights, management dashboards and reports are created to support management. These processes need quite different systems, systems that are able to process large amounts of data in a short time and are able to crunch large numbers. Storage of this vast amount of data should also be cheaper than its transactional counterpart.
Data storage systems are optimised to process large amounts of data at the same time.
Different data storage options are available: data warehouses, data lakes and data lakehouses.
Data warehouses
The name data warehouse already is a great analogy; it is a warehouse where data is stored. A warehouse differs from a store: in a store the focus lies on finding, selling, and delivering individual products. In a warehouse products are stored in bulk and moved in bulk, mostly for a longer period of time. Just as it is for the data variant. In the transactional systems (the systems that help the organisation to deliver their products and services), individual data cases are found, changed, and saved fast ánd, in a structured way. In a data warehouse this same data is stored for a long period of time. Data warehouses are also used to store snapshots of the data. They are a great basis for reporting and an excellent starting point to perform detailed analysis. Data warehouses contain structured data, for instance data that is modelled and defined in a structured way. Due to this structured character a data warehouse has an exceptional technical architecture for maintenance.
Data lakes
The term data lake is a little less self-explaining. In a data lake structured- and unstructured data are stored. Structured data is explained earlier. Examples of unstructured data are text, images, pdf’s, Excel files, Word documents, CSV’s, and Json. One of the important advantages of a data lake is that it is very flexible with the type of data to store as well as the volume of this data. The main users of the data lake are the AI- & data science projects which handle data with large volume (Big Data). One of the disadvantages of a data lake is that one can easily get lost in the tremendous amount and variety of data that is stored in a data lake, resulting in a data swamp! Finding data, understanding the meaning of the data, and trusting the quality of the data in question is a big challenge in data lakes.
Data lakehouses
This is where the data lakehouse steps in. In its base it is a data lake, so it can benefit from the same advantages. To overcome the disadvantages of the data lake, the data lakehouse is expanded with functions to find and understand the data that is stored. Additionally the data lakehouse has a connection point where data sources can be accessed in the same universal way, making it easier to interact with the data.
Enthusiastic or curious about topics on Data Governance? At Clever Republic we are keen on sharing our thoughts about connecting data to systems, processes, people and policies. Feel free to contact us, we are happy to answer any questions you might have.
Frequently asked questions:
A data warehouse is a structured storage system that is designed for query and analysis. It typically stores data from multiple sources in a structured format and is optimised for fast querying and reporting.
In a data lake structured- and unstructured data are stored. One of the important advantages of a data lake is that it is very flexible with the type of data to store as well as the volume of this data. One of the disadvantages of a data lake is that one can easily get lost in the tremendous amount and variety of data that is stored in a data lake, resulting in a data swamp! Finding data, understanding the meaning of the data, and trusting the quality of the data in question is a big challenge in data lakes.
A data lakehouse combines elements of both data lakes and data warehouses. It aims to address some of the limitations of traditional data warehouses and data lakes. A data lakehouse leverages the flexibility of a data lake to store raw, unstructured data, while also incorporating some structure and organisation, similar to a data warehouse. This approach is intended to provide the best of both worlds, allowing for scalable storage of diverse data types and efficient query performance.