The Cloudera Data Platform (CDP) Public Cloud provides the foundation upon which full featured data lakes are created. In a previous article, we introduced the CDP platform. This article is the second in a series of six to learn how to build end-to-end big data architectures with CDP: More specifically, we are going to:
- Create a credential that permits CDP to manage resources on AWS
- Configure an AWS CloudFormation stack that serves as root of our deployment
- Deploy a CDP Environment including a Data Lake to AWS
The configuration and deployment can be accomplished via the web interfaces of Cloudera and Amazon – generally referred to as the AWS console or the CDP console – or via their respective CLI tools. We cover both approaches. First, we demonstrate how to perform all preparatory steps and the actual deployment via the consoles. Second, we provide the console commands to perform the same tasks from a terminal using the CLI tools. Before we begin, a couple of important remarks:
This deployment is based on the AWS quickstart documentation by Cloudera and aims to provide a useable environment as quickly as possible. It is not optimized for production use, and it is also not suitable for use cases in which you want to use existing infrastructure components – such as VPCs and subnet groups – instead of CDP-managed ones. If you decide to follow along, be aware that CDP creates resources on your AWS account that incur costs. You find a list of resources that are created during this deployment and a ballpark estimate of the associated costs at the end of the article. Always make sure to delete cloud resources that are no longer in use to avoid unwanted costs. With that said, let’s begin by configuring our CDP and AWS accounts. As a reminder, you need at least Power User privileges on CDP and Administrator access on AWS to follow along.
Deploy using the CDP and AWS Web Interfaces
This approach is recommended if you are new to CDP and/or AWS. It is slower but gives you a better idea of the various steps involved in the deployment process. If you did not install and configure the CDP CLI and the AWS CLI as described in the first part of the series, this is also your only option. If you want to go faster and use the terminal to manage your deployment, scroll down to the Deploy from the Terminal section. Note that you still have to use the CDP console to create your CDP credential. We recommend you to follow the below steps until the point where you copy your Cross-account access role Amazon Resource Name (ARN).
Create a CDP Credential
CDP Public Cloud creates and manages AWS resources on your behalf. It is therefore necessary to delegate access to your AWS account via a cross-account access role. Our first step is to create this role for your AWS account and store it in your CDP account as credential. To begin, log in to the Cloudera console and access the Management Console: Navigate to Shared Resources > Credentials and click on Create Credential on the top right: In the Create Credential menu, select AWS, then enter a name and optionally a description for your credential. This name and description are used on the CDP-side of your architecture. Copy the AWS IAM policy that is available under Create Cross-account Access Policy. Be sure to select the version with Default permissions, not the one with Minimal permissions. In a new browser tab, navigate to Identity and Access Management (IAM) – Policies in your AWS Console and click Create Policy. Paste the policy document you have copied from the CDP console: Click Next, optionally add tags and click Next again: Review the policy document, provide a name and an optional description. AWS displays a warning message that you may ignore. Click Create policy. Stay in your AWS IAM console and navigate to Roles, then select Create role: Under Trusted Entity Type select AWS Account. Select Another AWS account below and tick the option Require external ID: Return to your CDP console and copy the Service Manager Account ID and the External ID into the corresponding fields on AWS. In the AWS IAM console, click Next after you pasted the two ids: Under Permissions policies, find the policy you created earlier and tick the checkbox on the left, then click Next: Under Name, review, and create, enter a name and optionally a description for your role. Scroll down, optionally add tags and then click Create: Find your newly created role in the AWS IAM console: Copy the ARN of your newly created role: Go back to your CDP console and paste the ARN of your cross-account access role into the corresponding field, then click Create: Congratulations, you have set up your credential to manage AWS resources via CDP.
Configure an AWS CloudFormation Stack
Next, we create a CloudFormation stack. This stack is going to contain the basic IAM policies, roles, and instance profiles that are used by our CDP resources as well as the basic configuration of our data lake. To start, download the CloudFormation stack template provided by Cloudera Next, access your AWS console and navigate to the CloudFormation service. Important: Make sure you are connected to the AWS region you want to create your stack in. For the purpose of this tutorial, we stay in the EU Ireland (eu-west-1) region. Click on Create stack. Select Template is ready and Upload template file, then use the file upload dialog to upload the stack template you downloaded earlier. When done, click Next. Configure your stack as follows: Choose a stack name, for example my-cdp-stack Choose a S3 bucket and directory to store backups, for example my-unique-cdp-bucket/backups Choose a S3 bucket and directory to store logs, for example my-unique-cdp-bucket/logs Choose a S3 bucket and directory to store data, for example my-unique-cdp-bucket/data Decide a prefix to use for all IAM resources generated by this stack, for example cdp Remember that your S3 bucket name must be globally unique. Be sure to use the same bucket for all three storage locations (/backups, /logs, and /data). Click Next, optionally add tags for your stack but change nothing else and click Next again. Under Review stack, scroll all the way to the bottom and confirm you acknowledge that AWS CloudFormation might create IAM resources with custom names. Click Submit to create your stack. Wait for your stack to create. You see a green CREATE COMPLETE message in CloudFormation once the process has completed successfully. And that’s it! You now have a stack on which you may deploy a CDP Public Cloud Environment in AWS.
Create an SSH Key Pair
When you create your CDP environment you are required to provide an SSH Key pair. While you have the option to create a new key pair as you register the environment, it is preferable to create it in advance. To create a new SSH key pair, access your AWS console and navigate to EC2 > Network & Security > Key Pairs. Make sure you are in the region you want to create your environment in and click Create key pair: Under Create key pair, provide a name for your key pair. You are going to need this name later when you create your environment. Choose RSA as Key pair type and .pem as Private key file format. Optionally add some tags and click Create key pair.
Register a CDP Environment in AWS
With all the setup complete, we are now finally ready to launch our CDP environment on AWS. Before we proceed it is important to remind you that the resources launched by CDP are not free. If you decide to follow along, you will incur some cost on your AWS account. Whenever you practice with any cloud service, be sure to remove resources when done. To begin deploying an environment via the CDP console, navigate to Management Console > Environments and click Register Environment: In the Register Environment dialog, provide a name and optionally a description for your environment. Select AWS as Cloud Provider and pick the credential you created earlier, then click Next: Provide a name and select a runtime version for your data lake. Always select the latest available runtime version unless you have a specific requirement for an earlier version. Choose a region to launch your environment in and optionally configure advanced settings. Click Register to start the registration process. After the registration process has completed successfully, you have a fully operational CDP Environment including a Data Lake deployed to AWS. You can now navigate to your environment to perform data management tasks or provision clusters.
