Activating Data Services on CDP Public Cloud Environment: CDP Part 3

One of the main advantages of Cloudera Data Platform (CDP) is its mature managed service offering, which can be easily deployed on-premises, in the public cloud, or as part of a hybrid solution. Our end-to-end architecture heavily relies on several CDP services, including DataFlow, Data Engineering, Data Warehouse, and Data Visualization.

DataFlow, powered by Apache NiFi, enables us to transport data from various sources to different destinations. In our architecture, we use DataFlow to ingest data from an API and transport it to our Data Lake hosted on AWS S3.

Data Engineering, built on Apache Spark, provides powerful features for streamlining and operationalizing data pipelines. We leverage Data Engineering to run Spark jobs that transform our data and load the results into our analytical Data Warehouse.

Data Warehouse is a self-service analytics solution that allows business users to access large amounts of data. It supports Apache Iceberg, a modern data format used for storing ingested and transformed data. Finally, we serve our data through the built-in Data Visualization feature of the Data Warehouse service.

This article, the third in a series of six, documents the activation process of these services in the CDP Public Cloud environment deployed on Amazon Web Services (AWS). It provides a step-by-step guide on how to enable each service and includes information on the resources created in your AWS account and an estimated cost.

Note that this deployment follows Cloudera’s quickstart recommendations for DataFlow, Data Engineering, and Data Warehouse and is intended to quickly set up a functional environment rather than optimize for production use. Additionally, please remember to release your resources when not in use to avoid unnecessary costs.

The article outlines two approaches for enabling the CDP Public Cloud services: via the Cloudera console and through the CDP CLI. The console approach is recommended for those who are new to CDP and/or AWS, while the CLI approach is preferred by experienced users seeking a quicker deployment.

For the console approach, you need to access the Cloudera console and follow specific steps for each service. For example, to enable DataFlow, navigate to the DataFlow section in the console, access the Environments tab, and click “Enable” next to your environment. Ensure to enable the Public Endpoint and add any desired tags before enabling the service. The process is similar for enabling Data Engineering and Data Warehouse, with related instructions provided in the article.

For the CLI approach, you must ensure specific variables are declared in your terminal session and then run the corresponding CLI commands to enable each service. The article provides detailed commands and instructions for enabling DataFlow and fully enabling Data Engineering. Due to current limitations, the creation of a virtual cluster for Data Engineering must be done through the console.

Once you have successfully enabled all the services, you will have a fully functional Data Warehouse environment with all the necessary features for deploying your end-to-end architecture. The article concludes by mentioning that adding users to the Data Visualization service will be covered in a separate article.

Remember to consult the article for the complete set of instructions and additional details on each step of the process.

Related Articles

Latest Updates