CDP Part 1: An Intro to CDP’s End-to-End Data Lakehouse Architecture

Cloudera Data Platform (CDP) is a hybrid data platform for big data transformation, machine learning and data analytics. In this series we describe how to build and use an end-to-end big data architecture with Cloudera CDP Public Cloud on Amazon Web Services (AWS). Our architecture is designed to retrieve data from an API, store it in a data lake, move it to a data warehouse and eventually serve it in a data visualization application to analytics end users. This series includes the following six articles:

Architectural considerations

The purpose of our architecture is to support a data pipeline that allows the analysis of variations in the stock price of multiple companies. We are going to retrieve data, ingest it into a data warehouse and eventually plot it on charts to visually gain insights. This architecture requires the following capabilities:

  • We need an application that extracts the stock data from a web API and stores it in a cloud provider’s storage solution.
  • We also need the ability to run jobs that transform the data and load it into a data warehouse.
  • The data warehouse solution must be able to store the incoming data and support querying with SQL syntax.
  • Also, we want to make sure we can use the modern Apache Iceberg table format.
  • Finally, we use the analytics service natively present in the Cloudera platform.

With this in mind, let’s take a closer look at what CDP offers.

CDP Architecture

Every CDP Account is associated with a control plane, a shared infrastructure that facilitates the deployment and operation of CDP Public Cloud services. Cloudera offers control planes in three regions: us-west-1 hosted in the USA, eu-1 located in Germany, and ap-1 based in Australia. At the time of writing, us-west-1 is the only region in which all data services are available. The official CDP Public Cloud documentation lists available services per region.

CDP does not host data or perform computations. In the case of a public cloud deployment, CDP uses the infrastructure of an external cloud provider – AWS, Azure, or Google Cloud – to perform computations and store data for its managed services. CDP also allows users to create private cloud deployments on on-premises hardware or using cloud infrastructure. In the latter case, Cloudera provides the Cloudera Manager application that is hosted on your infrastructure to configure and monitor the core private cloud clusters. In this and subsequent articles, we will focus exclusively on a public cloud deployment with AWS.

CDP Public Cloud allows users to create multiple environments hosted on different cloud providers. An environment groups virtual machines and virtual networks on which managed CDP services are deployed. It also holds user configurations such as user identities and permissions. Environments are independent of each other: a CDP user can run multiple environments on the same cloud provider or multiple environments on different cloud providers. However it has to be said that some CDP services are not available on all cloud providers. For example, at the time of writing only environments hosted on AWS allow the CDP Data Engineering service to use Apache Iceberg tables. The schema below describes the relationship between CDP and the external cloud provider:

CDP Services

The below image shows the landing page of the CDP Console, the web interface of the platform, in the us-west-1 region:

The left-to-right order of the services displayed in the console is logical as it follows the pipeline process. The DataFlow service extracts data from various sources, while the Data Engineering service handles data transformations. The Data Warehouse or Operational Database services stores ready-to-use data, and finally, the Machine Learning service allows data scientists to perform artificial intelligence (AI) tasks on the data. Let’s describe the services in more detail, with a focus on the ones we use in our end-to-end architecture.

DataFlow

This service is a streaming application that allows users to pull data from various sources and place it in various destinations for staging, like an AWS S3 bucket, while using triggers. The underlying component of this service is Apache NiFi. All data streams created by users are stored in a catalog. Users may choose from the available flows and deploy them to an environment. Some ready-made flows for specific purposes are stored in the ReadyFlow gallery which is shown below. DataFlow is either activated as a “deployment”, which creates a dedicated cluster on your cloud provider, or in a “functions” mode that makes use of serverless technologies (AWS Lambda, Azure Functions or Google Cloud Functions).

Data Engineering

This service is the core extract, transform and load (ETL) component of CDP Public Cloud. It performs the automatic orchestration of a pipeline by ingesting and processing data to make it usable for any subsequent use. It takes data from a staging area by the DataFlow service and runs Spark or AirFlow jobs. In order to use this service, users need to enable it and create a virtual cluster where these orchestration jobs can run. The service also requires virtual machines and database clusters in your external cloud provider.

Data Warehouse

This service allows users to create databases and tables and perform queries on the data using SQL. A warehouse holds data ready for analysis, and the service includes a Data Visualization feature. Users need to enable the Data Warehouse service for their environment and create a so-called “virtual data warehouse” to handle analytical workloads. These actions create Kubernetes clusters and a filesystem storage (EFS in the case of AWS) on the external cloud provider.

Operational Database

This service creates databases for dynamic data operations and is optimized for online transactional processing (OLTP). This distinguishes it from the Data Warehouse service, which is optimized for online analytical processing (OLAP). Since we don’t need OLTP capabilities, we’re not going to use the Operational Database service, and so we won’t discuss it further.

Machine Learning

CDP Machine Learning is the tool used by data scientists to perform estimations, classifications and other AI-related tasks. We have no need for machine learning in our architecture and therefore we’re not going into more detail on this service. For any additional information refer to the Cloudera website.

Our Architecture

Now that we’ve had a look at the services offered by CDP, the following architecture emerges:

Our CDP Public Cloud environment is hosted on AWS as this is currently the only option that supports Iceberg tables. Data is ingested using CDP DataFlow and stored in a data lake built on Amazon S3. Data processing is handled by Spark jobs that run via the Data Engineering service. Processed data is loaded into a Data Warehouse and ultimately served via the built-in Data Visualization feature. The next two articles configure the environment. Then, you will learn how to manage users and their permissions. Finally, we create the data pipeline.

Follow along: Prerequisites

If you want to follow along as we progress in our series and deploy our end-to-end architecture yourself, certain requirements need to be met.

AWS resource needs and quotas

As described in the previous sections, each CDP service provisions resources from your external cloud provider. For example, running all the required services deploys a small fleet of EC2 instances with many virtual CPUs across them. In consequence, you need to pay attention to the service quota Standard Instance Run on Demand (A, C, D, H, I, M, R, T, Z). This quota governs how many virtual CPUs you may provision simultaneously.

To verify if your quota is high enough and to increase it if necessary, do the following in your AWS console:

  1. Navigate to the region where you want to create the resources
  2. Click on your user name
  3. Click on Service Quotas

Now let’s look at the quotas for EC2

  1. Click on Amazon Elastic Compute Cloud (Amazon EC2)
  2. To check the relevant quota limiting your vCPU usage:
  3. Type Running On-Demand Standard (A, C, D, H, I, M, R, T, Z) instances
  4. Check that the number of virtual CPUs is over 300 to be safe

If the quota is too restrictive, request an increase. This request can take more than 24 hours to be granted. Click on the name of the quota (action 3 in the…

Related Articles

Latest Updates