The Promise of a Modern Data Lakehouse Architecture
Imagine self-service access to all your business data and exploration at once, wherever you are. Imagine being able to quickly answer business-critical questions almost instantly without waiting for data to be discovered, shared, and ingested. Imagine being able to independently discover a wealth of new business insights using a combination of both structured and unstructured data. As data analysts or data scientists, we all want to be able to do all these things. This is the promise of modern data lakehouse architectures.
In “Exploring Lakehouse Architecture and Use Cases,” published January 11, 2022, Gartner, Inc. analyst Sumit Pal states: and data engineering he can achieve in one platform. On paper this sounds very good, but how do you actually build this within your organization to deliver on the promise of self-service across all your data?
New innovations bring new challenges
Cloudera has long supported data lakehouse use cases, using open source engines with open data and tabular formats for the same data, on-premises, or in any cloud. New innovations in the cloud have created a data explosion. We ask new, more complex questions of the data to get even better insights. Ingest new data sets in real time from more diverse sources than ever before. These new innovations pose new challenges for our data management solutions. These challenges require architectural changes and the adoption of new table formats that can support massive scale, increase the flexibility of compute engines and data types, and simplify schema evolution.
- Scale: With the massive growth of new data born in the cloud, files and tables require cloud-native data formats. These new formats must accommodate massive scale while shortening the response window for accessing, analyzing, and using these data sets for business insights. To meet this challenge, we need to incorporate new cloud-native table formats that can handle the scope and scale of modern data.
- Flexibility: As we grow in maturity and expertise in advanced analytics techniques, we demand more. You need to leverage more data types and levels of curation to get more insights from more data. With this in mind, it’s clear that the “one size fits all” architecture doesn’t work here. You need a diverse set of data services suitable for each workload and purpose, backed by optimized computing engines and tools.
- Schema Evolution: Fast-moving data and real-time data ingestion require new ways to maintain data quality, consistency, accuracy, and overall integrity. Data changes in many ways. The shape and format of data change. The volume, type, and velocity change. As each dataset changes throughout its lifecycle, it must be able to respond without burden or delay while maintaining data performance, consistency and reliability.
Cloud-native table format innovation: Apache Iceberg
A top-level Apache project, Apache Iceberg is a cloud-native table format built to tackle the challenges of modern data lakehouses. Today, Iceberg enjoys a large and active open source community, with solid innovation investment and significant industry adoption. Iceberg is a next-generation, cloud-native table format designed to be open and scalable to petabyte datasets. Cloudera has incorporated Apache Iceberg as a core element of the Cloudera Data Platform (CDP) and as a result is a very active contributor.
Apache Iceberg is built to tackle today’s challenges
Iceberg was born out of a need to tackle the challenges of modern analytics. Especially suitable for data born in the cloud. Through many innovations, Iceberg addresses explosive data scale, advanced ways to analyze and report on data, and rapid changes to data without loss of integrity.
- Iceberg processes massive amounts of data born in the cloud. With innovations like hidden partitioning and metadata stored at the file level, Iceberg makes querying very large data sets faster and making changes to data easier and safer.
- Iceberg is designed to support multiple analytics engines. Iceberg is open by design, not just because it’s open source. Iceberg contributors and committers are dedicated to the idea that for Iceberg to be most useful, it should support a wide range of computing engines and services. As a result, Iceberg supports Spark, Dremio, Presto, Impala, Hive, Flink and more. With more options for ingesting, managing, analyzing, and using your data, you can build more sophisticated use cases more easily. Users can choose the right engine, the right skill set, the right tools at the right time. Don’t get stuck in a fixed engine or toolset without locking your data into a single vendor solution.
- Iceberg is designed to adapt to data changes quickly and efficiently. Innovations such as schema and partition evolution mean that data structure changes happen smoothly. With ACID compliance for fast ingest data, Iceberg handles fast-moving data without losing the integrity and accuracy of a data lakehouse.
Architectural innovation: Cloudera Data Platform (CDP) and Apache Iceberg
With Cloudera Data Platform (CDP), Iceberg is not “just another form of table” that can be accessed from your own compute engine using external tables or a similar “bolt-on” approach. CDP fully integrates Iceberg as the primary table format into its architecture, making data easier to access, manage, and use.
CDP includes a common metastore and fully integrates this metastore with the Iceberg table. This means that your Iceberg-formatted data assets are fully embedded in CDP’s unique Shared Data Experience (SDX), allowing you to take full advantage of this single source for security and metadata management. With SDX, CDP supports the self-service needs of data scientists, data engineers, business analysts, and machine learning professionals with purpose-built, pre-integrated services.
Pre-integrated services that share the same data context are key to developing modern business solutions that lead to transformational change. We’ve seen companies struggle to integrate multiple analytics solutions from multiple vendors. All new dimensions, such as capturing data streams, auto-tagging data for security and governance, performing data science and AI/ML work, bring data in and out of proprietary formats, and custom integration points between services. It was necessary to develop CDP and Apache Iceberg bring together data services under one roof and one data context.
CDP uses tight compute integration with Apache Hive, Impala, and Spark to ensure optimal read and write performance. And unlike other solutions that are compatible with Apache Iceberg tables and can read and analyze them, Cloudera makes Iceberg an integral part of CDP, making it a fully native table format across platforms, read and write, ACID Supports compliance. , schema and partition evolution, time travel, and more. With this approach, new data services can be easily added and data is not morphed or moved unnecessarily just to fit the process.
In-place upgrade of external tables
With petabytes of data already out there and serving mission-critical workloads in many industries today, it’s a shame that data gets left behind. Cloudera added a simple alter table statement to smoothly migrate Hive-managed tables to Iceberg tables using CDP. Therefore, no data is moved. You can immediately take advantage of the Iceberg table format by simply changing the metadata.
Start architectural innovation for CDP with Iceberg today
Whether you’re a data scientist, data engineer, data analyst, or machine learning expert, you can get started with Iceberg-powered data services at CDP today.see us ClouderaNow Data Lakehouse Video Learn more about Open Data Lakehouse or get started with a few easy steps on our blog How to use Apache Iceberg with CDP’s Open Lakehouse.