Personal Lessons Learned. Primarily working as an end-user and in knowledge management roles within large organizations, learning about the dynamics of modern data management architectures was useful if only to gain initial exposure to the terms and current trends. I plan on revisiting these topics after interviewing some industry experts regarding the practical use of these architectures.
Introduction. There’s a plethora of information published on data warehouses and data lakes, including two great sources listed below (Amazon and Gartner). Here, we don’t need to restate the definitions, but rather draw on take-aways and highlights contrasts between the two terms -both of which are critical within the modern data management landscape.
Both data warehouses and data lakes are used to store large amounts of client data. Gartner produced a great article in 2020 identifying similarities, differences, uses, trends, and recommendations for these architectural patterns, cited below. We will highlight some of the key findings here:
Similarities (Data Warehouses & Data Lakes): Both patterns enable data analysis by providing repositories for large amounts of raw data which can be collected and analyzed. According to Friedman and Heudecker, “both provide an endpoint for collection of transactional, detailed data… specifically to support the execution of analytic workloads.”
Differences (Data Warehouses & Data Lakes): The primary contrast between these two systems is the method and focus. Data warehouses contain “curated” data, or that which has been formatted and standardized to some extent. Data lakes, in contrast, contain data in a generally more raw form which may differ depending on the source. Data lakes are premised on storing information “as-is”.
I. Data Lakes collect and store “unrefined data…with limited transformation and quality assurance…and events captured from a diverse array of source systems” (Gartner, citation below). Enterprises with use cases in “exploratory analysis and data science activities” across multiple types and sources of data (i.e., from mobile phones, internal networks, external sources) will likely find the data lake format to be more useful in enabling their analysis activities. This means creative solutions and previously undefined links and trends – i.e., non-intuitive insights – may be better found using analytics processes with data lakes.
II. Data warehouses are “database[s] optimized to analyze relational data coming from transactional systems and line of business applications….data is cleaned, enriched, and transformed so it can act as the ‘single source of truth’ that users can trust.” (Amazon, citation below)
Taking Amazon’s definition of data warehousing in mind, there is a close connection between MDM (discussed in blog #3.1) and data warehousing. By standardizing data and eliminating redundancy in entities, MDM (architecture) and data warehousing (method) can work extremely well together synchronously.
Key Strategic Findings:
A. One of the important highlights of these two data architectures is that enterprises are not required to choose one or the other. Because analysis functions are growing increasingly complex and demanding, both architectural systems can be used to provide either (a) disparate or (b) sequential results. Friedman and Heudecker note that “it is equally important to recognize that these architectural patterns can bring more value to the enterprise when used in combination…the data warehouse, data lake and data hub can be combined to work together in an effective architecture.” For example, results from more diverse, less-structured data lake can be transferred and further analyzed in the data warehouse.
B. It’s important to note that these repositories store data long-term (in contrast to data hubs, which generally pass data on and do not store it). In many cases, stored data may not be current. “Because the data warehouse is used for analytical processing, it contains data reflecting a specific point in time rather than the most current values.” – Amazon. Analysts must account for time-value of data and understand the limitations of real-time data when leveraging data warehouses.
Sources:
(1) “What Is A Data Lake”, Amazon, 2022. https://aws.amazon.com/big-data/datalakes-and-analytics/what-is-a-data-lake/
(2) “Data Hubs, Data Lakes and Data Warehouses: How They Are Different and Why They Are Better Together”, Gartner, Refreshed 2 June 2021, Published 13 February 2020. https://www.gartner.com/document/3980938?ref=d-linkShare
(3) “Ten Steps to Build an Agile Information Architecture”