Emr docker

Emr docker DEFAULT

Run Spark applications with Docker using Amazon EMR 6.x

With Amazon EMR 6.0.0, Spark applications can use Docker containers to define their library dependencies, instead of installing dependencies on the individual Amazon EC2 instances in the cluster. To run Spark with Docker, you must first configure the Docker registry and define additional parameters when submitting a Spark application. For more information, see Configure Docker integration.

When the application is submitted, YARN invokes Docker to pull the specified Docker image and run the Spark application inside a Docker container. This allows you to easily define and isolate dependencies. It reduces the time for bootstrapping or preparing instances in the Amazon EMR cluster with the libraries needed for job execution.

When running Spark with Docker, make sure the following prerequisites are met:

  • The package and CLI are only installed on core and task nodes.

  • On Amazon EMR 6.1.0 and later, you can alternatively install Docker on a master node by using following commands.

  • The command should always be run from a master instance on the Amazon EMR cluster.

  • The Docker registries used to resolve Docker images must be defined using the Classification API with the classification key to define additional parameters when launching the cluster:

  • To execute a Spark application in a Docker container, the following configuration options are necessary:

  • When using Amazon ECR to retrieve Docker images, you must configure the cluster to authenticate itself. To do so, you must use the following configuration option:

    • YARN_CONTAINER_RUNTIME_DOCKER_CLIENT_CONFIG={DOCKER_CLIENT_CONFIG_PATH_ON_HDFS}

  • In EMR 6.1.0 and later, you are not required to use the listed command when the ECR auto authentication feature is enabled.

  • Any Docker image used with Spark must have Java installed in the Docker image.

For more information about the prerequisites, see Configure Docker integration.

Creating a Docker image

Docker images are created using a Dockerfile, which defines the packages and configuration to include in the image. The following two example Dockerfiles use PySpark and SparkR.

PySpark Dockerfile

Docker images created from this Dockerfile include Python 3 and the NumPy Python package. This Dockerfile uses Amazon Linux 2 and the Amazon Corretto JDK 8.

SparkR Dockerfile

Docker images created from this Dockerfile include R and the randomForest CRAN package. This Dockerfile includes Amazon Linux 2 and the Amazon Corretto JDK 8.

For more information on Dockerfile syntax, see the Dockerfile reference documentation.

Using Docker images from Amazon ECR

Amazon Elastic Container Registry (Amazon ECR) is a fully-managed Docker container registry, which makes it easy to store, manage, and deploy Docker container images. When using Amazon ECR, the cluster must be configured to trust your instance of ECR, and you must configure authentication in order for the cluster to use Docker images from Amazon ECR. For more information, see Configuring YARN to access Amazon ECR.

To make sure that EMR hosts can access the images stored in Amazon ECR, your cluster must have the permissions from the policy associated with the instance profile. For more information, see Policy.

In this example, the cluster must be created with the following additional configuration to ensure that the Amazon ECR registry is trusted. Replace the endpoint with your Amazon ECR endpoint.

Using PySpark with Amazon ECR

The following example uses the PySpark Dockerfile, which will be tagged and uploaded to Amazon ECR. After the Dockerfile is uploaded, you can run the PySpark job and refer to the Docker image from Amazon ECR.

After you launch the cluster, use SSH to connect to a core node and run the following commands to build the local Docker image from the PySpark Dockerfile example.

First, create a directory and a Dockerfile.

Paste the contents of the PySpark Dockerfile and run the following commands to build a Docker image.

Create the ECR repository for the examples.

Tag and upload the locally built image to ECR, replacing with your ECR endpoint.

Use SSH to connect to the master node and prepare a Python script with the filename . Paste the following content into the file and save it.

On EMR 6.0.0, to submit the job, reference the name of the Docker image. Define the additional configuration parameters to make sure that the job execution uses Docker as the runtime. When using Amazon ECR, the must reference the file containing the credentials used to authenticate to Amazon ECR.

On EMR 6.1.0 and later, to submit the job, reference the name of the Docker image. When ECR auto authentication is enabled, run the following command.

When the job completes, take note of the YARN application ID, and use the following command to obtain the output of the PySpark job.

Using SparkR with Amazon ECR

The following example uses the SparkR Dockerfile, which will be tagged and uploaded to ECR. Once the Dockerfile is uploaded, you can run the SparkR job and refer to the Docker image from Amazon ECR.

After you launch the cluster, use SSH to connect to a core node and run the following commands to build the local Docker image from the SparkR Dockerfile example.

First, create a directory and the Dockerfile.

Paste the contents of the SparkR Dockerfile and run the following commands to build a Docker image.

Tag and upload the locally built image to Amazon ECR, replacing with your Amazon ECR endpoint.

Use SSH to connect to the master node and prepare an R script with the name . Paste the following contents into the file.

On EMR 6.0.0, to submit the job, refer to the name of the Docker image. Define the additional configuration parameters to make sure that the job execution uses Docker as the runtime. When using Amazon ECR, the must refer to the file containing the credentials used to authenticate to ECR.

On EMR 6.1.0 and later, to submit the job, reference the name of the Docker image. When ECR auto authentication is enabled, run following command.

When the job has completed, note the YARN application ID, and use the following command to obtain the output of the SparkR job. This example includes testing to make sure that the randomForest library, version installed, and release notes are available.

Using a Docker image from Docker Hub

To use Docker Hub, you must deploy your cluster to a public subnet and configure it to use Docker Hub as a trusted registry. For more information, see Configure Docker integration.

In this example, the cluster needs the following additional configuration to make sure that the your-public-repo repository on Docker Hub is trusted. When using Docker Hub, replace this repository name with your actual repository.

Document Conventions

Create a cluster with Spark

Use the AWS Glue Data Catalog as the metastore for Spark SQL

Sours: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-docker.html

Configure Docker

Amazon EMR 6.x supports Hadoop 3, which allows the YARN NodeManager to launch containers either directly on the Amazon EMR cluster or inside a Docker container. Docker containers provide custom execution environments in which application code runs. The custom execution environment is isolated from the execution environment of the YARN NodeManager and other applications.

Docker containers can include special libraries used by the application and they can provide different versions of native tools and libraries, such as R and Python. You can use familiar Docker tooling to define libraries and runtime dependencies for your applications.

Amazon EMR 6.x clusters are configured by default to allow YARN applications, such as Spark, to run using Docker containers. To customize your container configuration, edit the Docker support options defined in the and files available in the directory. For details about each configuration option and how it is used, see Launching applications using Docker containers.

You can choose to use Docker when you submit a job. Use the following variables to specify the Docker runtime and Docker image.

When you use Docker containers to run your YARN applications, YARN downloads the Docker image that you specify when you submit your job. For YARN to resolve this Docker image, it must be configured with a Docker registry. The configuration options for a Docker registry depend on whether you deploy the cluster using a public or private subnet.

Docker registries

A Docker registry is a storage and distribution system for Docker images. For Amazon EMR 6.x, the following Docker registries can be configured:

  • Docker Hub – A public Docker registry containing over 100,000 popular Docker images.

  • Amazon ECR – A fully managed Docker container registry that allows you to create your own custom images and host them in a highly available and scalable architecture.

Deployment considerations

Docker registries require network access from each host in the cluster. This is because each host downloads images from the Docker registry when your YARN application is running on the cluster. These network connectivity requirements may limit your choice of Docker registry, depending on whether you deploy your Amazon EMR cluster into a public or private subnet.

Public subnet

When EMR clusters are deployed in a public subnet, the nodes running YARN NodeManager can directly access any registry available over the internet, including Docker Hub.

Private subnet

When EMR clusters are deployed in a private subnet, the nodes running YARN NodeManager don't have direct access to the internet. Docker images can be hosted in Amazon ECR and accessed through AWS PrivateLink.

For more information about how to use AWS PrivateLink to allow access to Amazon ECR in a private subnet scenario, see Setting up AWS PrivateLink for Amazon ECS, and Amazon ECR.

Configuring Docker registries

To use Docker registries with Amazon EMR, you must configure Docker to trust the specific registry that you want to use to resolve Docker images. The default trust registries are local (private) and centos (on public Docker Hub). To use other public repositories or Amazon ECR, you can override settings in using the EMR Classification API with the classification key.

The following example shows how to configure the cluster to trust both a public repository, named , and an ECR registry endpoint, . If you use ECR, replace this endpoint with your specific ECR endpoint. If you use Docker Hub, replace this repository name with your repository name.

To launch an Amazon EMR 6.0.0 cluster with this configuration using the AWS Command Line Interface (AWS CLI), create a file named with the contents of the preceding ontainer-executor JSON configuration. Then, use the following commands to launch the cluster.

Configuring YARN to access Amazon ECR on EMR 6.0.0 and earlier

If you're new to Amazon ECR, follow the instructions in Getting started with Amazon ECR and verify that you have access to Amazon ECR from each instance in your Amazon EMR cluster.

On EMR 6.0.0 and earlier, to access Amazon ECR using the Docker command, you must first generate credentials. To verify that YARN can access images from Amazon ECR, use the container environment variable to pass a reference to the credentials that you generated.

Run the following command on one of the core nodes to get the login line for your ECR account.

The command generates the correct Docker CLI command to run to create credentials. Copy and run the output from .

This command generates a file in the folder. Copy this file to HDFS so that jobs submitted to the cluster can use it to authenticate to Amazon ECR.

Run the commands below to copy the file to your home directory.

Run the commands below to put the config.json in HDFS so it may be used by jobs running on the cluster.

YARN can access ECR as a Docker image registry and pull containers during job execution.

After configuring Docker registries and YARN, you can run YARN applications using Docker containers. For more information, see Run Spark applications with Docker using Amazon EMR 6.0.0.

In EMR 6.1.0 and later, you don't have to manually set up authentication to Amazon ECR. If an Amazon ECR registry is detected in the classification key, the Amazon ECR auto authentication feature activates, and YARN handles the authentication process when you submit a Spark job with an ECR image. You can confirm whether automatic authentication is enabled by checking in yarn-site. Automatic authentication is enabled and the YARN authentication setting is set to if the contains an ECR registry URL.

Prerequisites for using automatic authentication to Amazon ECR

  • EMR version 6.1.0 or later

  • ECR registry included in configuration is in the same Region with the cluster

  • IAM role with permissions to get authorization token and pull any image

Refer to Setting up with Amazon ECR for more information.

How to enable automatic authentication

Follow Configuring Docker registries to set an Amazon ECR registry as a trusted registry, and make sure the Amazon ECR repository and the cluster are in same Region.

To enable this feature even when the ECR registry is not set in the trusted registry, use the configuration classification to set to .

How to disable automatic authentication

By default, automatic authentication is disabled if no Amazon ECR registry is detected in the trusted registry.

To disable automatic authentication, even when the Amazon ECR registry is set in the trusted registry, use the configuration classification to set to .

How to check if automatic authentication is enabled on a cluster

On the master node, use a text editor such as to view the contents of the file: . Check the value of .

Sours: https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-docker.html
  1. Coursera programs
  2. Below mentioned synonym
  3. Toyota center hubcaps

AWS EMR is a service for launching workloads onto large clusters for processing. It supports Spark and PySpark workloads. Launching PySpark scripts that are self-contained and generally don’t need a lot of dependent libraries is… well, it’s not easy, but it’s well-documented. Complex jobs within large pipelines, though, often require specific dependencies and specific settings that need to be managed and if we can simplify that management through containers and match that up to how we test locally, that’s a big plus for us.

Let’s expand the Docker container we built in for our local use and get code running using that container.

We’re going to assume that you have the permissions to launch EMR containers from the console and are able to give that cluster permissions to read from the AWS Elastic Container Repository (ECR). Setting up the correct permissions and network infrastructure for a cluster is a whole separate topic that warrants its own post. We’re also going to assume you know how to use ECR itself.

Building the Docker image

We’ll be using EMR 6.1.0 here: 6.0 is a bit more difficult to set up to use Docker images, so that’s preferred. The Dockerfile we wrote in that previous post was for Spark 2.4.7, but EMR 6.1 uses Spark 3.0.0 and Hadoop 3.2.1 - which also has the advantage of running on JDK 11 instead of JDK 8.

Go back to the other post to see all the details here. Follow the directions in the ECR console in the . Note that downloading the Spark and Hadoop packages here is what makes our testing work. Later we’ll see how to trim down our containers.

Updating Our Script

Now, let’s make this interesting and add a few dependencies. In the same directory as the (and file) above, run . Now we have a dependency that’s not always available to the code that we can use. Now let’s build the container and push it to ECR following the instructions on the AWS ECR Console for your repository or the documentation in the user guide. For the rest of this document, we’ll be using the image in the repository for our purposes.

Now let’s create a short script which uses the new dependencies we’ve added.

Call this and copy it to .

Launching the cluster

There’s a natural instinct to think that putting into your Docker container will make it available for execution. But, remember that PySpark scripts are run on the (unfortunately named) master node and they issue commands to the distributed workers (e.g., is a command which is send to all the workers, each of which writes the data they have access to to Parquet files at the given location). In fact, at the end of this post, we’ll talk a little more about what the EMR environment is actually doing to run these containers.

Let’s launch our cluster! For production loads, you would want to do more network configuration to put these machines into private subnets and set up IAM carefully. This will prevent the jobs from being able to access resources in your AWS account that it shouldn’t. But here, we’re going to rely on the defaults.

We’re going to need to install Docker on our EMR machines. We’re going to do that through bootstrap actions, commands that are run before the cluster starts running any steps.

  1. Create a file called and copy the following into it:
  1. Copy this to an S3 bucket in your account. For the remainder of this example, let’s say its in

  2. Log onto the AWS Console and in the “Services” dropdown in the top-left corner, select EMR. Click the button at the top that says “Create Cluster”. All the instructions are as of the writing of the post. Select at the top of the screen, since we’re going to have to add the bootstrap actions.

  3. Make sure to select release - our configuration will rely on specific features of that release.

  4. Click to go to the Hardware screen - the defaults for the hardware are reasonable enough for our example, so go ahead and click again. (Note that this relies on the existence of a default VPC and subnet in your account.)

  5. In the Configurations text box, we’re going to add a setting that will allow the cluster to trust the Docker image from ECR.

  6. Copy the following, making sure you use the URI from ECR to that corresponds to your repository. Don’t include the name of your registry (the part the hostname). We’re also assuming here that the EMR service role has access to these registries and can pull images.

  7. Name your cluster on the General Options screen. I used . Open up the section named on this screen and add a . Click the button. Name the action and in the field, add in the URL of your script.

  8. Click , keep the Security defaults and then click . And now we wait….

Running the script

Once the cluster is up and running, we can add a step and watch it run. The parameters here are going to be numerous, but we’ll go through each so they make sense.

  1. Select the Cluster on the EMR page of the AWS Console. Select the tab and hit the button.
  2. In the dialog, name your step (e.g., ), select and type in for the JAR location.
  3. Then, let’s put in the actual PySpark job definition. In the text box, copy in the following:

I put most of the PySpark parameters into the command itself rather than in the step above so we could go through them here. In production environments some of these parameters could be set globally there.

All of the and parameters are setting environment variables which change how the job gets run.

  • means that this job will use Docker containers for execution.
  • is the fully qualified URI to the Docker image in ECR, meaning that both the master environment and the executors will use that image for its context.
  • and tells the environment what Python executable in the image to use for running the individual tasks and the main program, respectively.
  • points to because the OpenJDK 11 image uses for its home, not , which is the default
  • is set to because AWS EMR has defaults that use the Oracle Java implementation, not the OpenJDK executable, so it fails unless we set it to something else. Note that this is where you could tune garbage collection settings and other Java performance settings. This is the same reason that is set to

How This Works and Future Improvements

Reading through the EMR documentation on configuring an image only says that for PySpark to work, all you need is a Java container. How does all of this complex infrastructure work if that’s the only requirement?

AWS EMR is eliding a lot of details about how this actually is working. When EMR creates its cluster without using Docker, it’s configuring the machines’ Hadoop and Spark settings tell each executor and driver where to go to request resources from YARN, mounting the S3 buckets into the Hadoop filesystem, and a number of other settings. When running in Docker, somehow the Docker containers need to get (1) that same configuration and (2) all the expected Java libraries to correctly run.

In fact, what’s happening is that AWS EMR is mounting a directory containing the Hadoop and Spark files (matching the expected version for EMR 6.1.x). In that directory is a dynamically-created set of configuration files which sets the appropriate server names. There’s a script that runs inside the container when jobs get run, which sets a number of environment variables, including and , to the mounted directory.

This mean that setting those environment variables - which you need to do to run unit tests - will break the ability to run them in production EMR. In fact, we included Hadoop and Spark in our Docker image so that unit tests pass, but they aren’t needed for execution. You may be able to get a much smaller image for production release, as long as it has Java in it.

Lastly, we definitely don’t want to have to bootstrap install Docker each time we launch a cluster. Creating a simple custom AMI with Docker installed, built off the default Amazon Linux image, would absolutely speed up start-up times.

Sours: https://hangar.tech/posts/emr-docker/
Hadoop on AWS using EMR Tutorial -- S3 -- Athena -- Glue -- QuickSight

Apache Spark is a powerful data processing engine that gives data analyst and engineering teams easy to use APIs and tools to analyze their data, but it can be challenging for teams to manage their Python and R library dependencies. Installing every dependency that a job may need before it runs and dealing with library version conflicts is time-consuming and complicated. Amazon EMR 6.0.0 simplifies this by allowing you to use Docker images from Docker Hub and Amazon ECR to package your dependencies. This allows you to package and manage your dependencies for individual Spark jobs or notebooks, without having to manage a spiderweb of dependencies across your cluster.

This post shows you how to use Docker to manage notebook dependencies with Amazon EMR 6.0.0 and EMR Notebooks. You will launch an EMR 6.0.0 cluster and use notebook-specific Docker images from Amazon ECR with your EMR Notebook.

Creating a Docker image

The first step is to create a Docker image that contains Python 3 and the latest version of the Python package. You create Docker images by using a Dockerfile, which defines the packages and configuration to include in the image. Docker images used with Amazon EMR 6.0.0 must contain a Java Development Kit (JDK); the following Dockerfile uses Amazon Linux 2 and the Amazon Corretto JDK 8:

You will use this Dockerfile to create a Docker image, and then tag and upload it to Amazon ECR. After you upload it, you will launch an EMR 6.0.0 cluster that is configured to use this Docker image as the default image for Spark jobs. Complete the following steps to build, tag, and upload your Docker image:

  1. Create a directory and a new file named using the following commands:
  2. Copy, and then paste the contents of the Dockerfile, save it, and run the following command to build a Docker image:
  3. Create the Amazon ECR repository for this walkthrough using the following command:
  4. Tag the locally built image and replace 123456789123.dkr.ecr.us-east-1.amazonaws.com with your Amazon ECR endpoint, using the following command:

    Before you can push the Docker image to Amazon ECR, you need to log in.

  5. To get the login line for your Amazon ECR account, use the following command:
  6. Enter and run the output from the command:
  7. Upload the locally built image to Amazon ECR and replace 123456789123.dkr.ecr.us-east-1.amazonaws.com with your Amazon ECR endpoint. See the following command:

After this push is complete, the Docker image is available to use with your EMR cluster.

Launching an EMR 6.0.0 cluster with Docker enabled

To use Docker with Amazon EMR, you must launch your EMR cluster with Docker runtime support enabled and have the right configuration in place to connect to your Amazon ECR account. To allow your cluster to download images from Amazon ECR, makes sure the instance profile for your cluster has the permissions from the managed policy associated with it. The configuration in the first step below configures your EMR 6.0.0 cluster to use Amazon ECR to download Docker images, and configures Apache Livy and Apache Spark to use the Docker image as the default Docker image for all Spark jobs. Complete the following steps to launch your cluster:

  1. Create a file named in the local directory with the following configuration (replace 123456789123.dkr.ecr.us-east-1.amazonaws.com with your Amazon ECR endpoint):

    You will use that configuration to launch your EMR 6.0.0 cluster using the AWS CLI.

  2. Enter the following commands (replace myKey with the name of the EC2 key pair you use to access the cluster using SSH, and subnet-1234567 with the subnet ID the cluster should be launched in):

    After the cluster launches and is in the state, make sure that the cluster hosts can authenticate themselves to Amazon ECR and download Docker images.

  3. Use your EC2 key pair to SSH into one of the core nodes of the cluster.
  4. To generate the Docker CLI command to create the credentials (valid for 12 hours) the cluster uses to download Docker images from Amazon ECR, enter the following command:
  5. Enter and run the output from the command:

    This command generates a file in the folder.

  6. Place the generated file in the HDFS location using the following command:

Now that you have an EMR cluster that is configured with Docker and an image in Amazon ECR, you can use EMR Notebooks to create and run your notebook.

Creating an EMR Notebook

EMR Notebooks are serverless Jupyter notebooks available directly through the Amazon EMR console. They allow you to separate your notebook environment from your underlying cluster infrastructure, and access your notebook without spending time setting up SSH access or configuring your browser for port-forwarding. You can find EMR Notebooks in the left-hand navigation of the EMR console.

To create your notebook, complete the following steps:

  1. Click on Notebooks in the EMR Console
  2. Choose a name for your notebook
  3. Click Choose an existing cluster and select the cluster you just created
  4. Click Create notebookEMRNotebookImg1
  5. Once your notebook is in a Ready status, you can click the Open in JupyterLab button to open it in a new browser tab. A default notebook with the name of your EMR Notebook is created by default. When you click on that notebook, you’ll be asked to choose a Kernel. Choose PySpark.
  6. Enter the following configuration into the first cell in your notebook and click ▸(Run):
  7. Enter the following PySpark code into your notebook and click ▸(Run):

The output should look like the following screenshot; the numpy version in use is the latest (at the time of this writing, 1.18.2).

EMRNotebookImg2

This PySpark code was run on your EMR 6.0.0 cluster using YARN, Docker, and the image that you created. EMR Notebooks connect to EMR clusters using Apache Livy. The configuration specified in configured your EMR cluster’s Spark and Livy instances to use Docker and the Docker image as the default Docker image for all Spark jobs submitted to this cluster. This allows you to use without having to install it on any cluster nodes. The following section looks at how you can create and use different Docker images for specific notebooks.

Using a custom Docker image for a specific notebook

Individual workloads often require specific versions of library dependencies. To allow individual notebooks to use their own Docker images, you first create a new Docker image and push it to Amazon ECR. You then configure your notebook to use this Docker image instead of the default image.

Complete the following steps:

  1. Create a new Dockerfile with a specific version of : 1.17.5.
  2. Create a directory and a new file named using the following commands:
  3. Enter the contents of your new Dockerfile and the following code to build a Docker image:
  4. Tag and upload the locally built image to Amazon ECR and replace 123456789123.dkr.ecr.us-east-1.amazonaws.com with your Amazon ECR endpoint using the following commands:

    Now that the numpy-1-17 Docker image is available in Amazon ECR, you can use it with a new notebook.

  1. Create a new Notebook by returning to your EMR Notebook and click File, New, Notebook, and choose the PySpark kernel.To tell your EMR Notebook to use this specific Docker image instead of the default, you need to use the following configuration parameters.
  1. Enter the following code in your notebook (replace 123456789123.dkr.ecr.us-east-1.amazonaws.com with your Amazon ECR endpoint) and choose ▸(Run):

    To check if your PySpark code is using version 1.17.5, enter the same PySpark code as before to use numpy and output the version.

  1. Enter the following code into your notebook and choose Run:

The output should look like the following screenshot; the numpy version in use is the version you installed in your Docker image: 1.17.5.

EMRNotebookImg3

Summary

This post showed you how to simplify your Spark dependency management using Amazon EMR 6.0.0 and Docker. You created a Docker image to package your Python dependencies, created a cluster configured to use Docker, and used that Docker image with an EMR Notebook to run PySpark jobs. To find out more about using Docker images with EMR, refer to the EMR documentation on how to Run Spark Applications with Docker Using Amazon EMR 6.0.0. Stay tuned for additional updates on new features and further improvements with Apache Spark on Amazon EMR.

 


About the Author

PaulCoddingPaul Codding is a senior product manager for EMR at Amazon Web Services.

 

 

 

 

Suthan1Suthan Phillips is a big data architect at AWS. He works with customers to provide them architectural guidance and helps them achieve performance enhancements for complex applications on Amazon EMR. In his spare time, he enjoys hiking and exploring the Pacific Northwest.

 

 

from AWS Big Data Blog: https://aws.amazon.com/blogs/big-data/simplify-your-spark-dependency-management-with-docker-in-emr-6-0-0/

Sours: https://awsfeed.com/whats-new/big-data/simplify-your-spark-dependency-management-with-docker-in-emr-6-0-0

Docker emr

How simplify Spark dependency management with Docker in EMR 6.x?

Today, pyspark users install pyspark dependencies on each host in an EMR cluster (usually with custom script or bootstrap actionds). This increase the complexity and cause many issue when the enviroment is multi-tenant or you have to run different libraries verions for different jobs (e.g. in case of machine learning or deep learning applications). Now, we can use Docker/YARN (from Hadoop 3.x) to manage Spark Dependencies. Application dependencis can now be scope for each individual application. The dependencies will be espressed in a Dockerfile and in the corresponding Docker image.

EMR 6.x is by default ready to run pyspark inside a docker.

EMR 6 supports registry like dockehub or private registry like AWS EMR repositories. Each node of the cluster will download the docker build image from your registry.

For Amazon EMR 6.x, the following Docker registries can be configured:

The base Spark and Hadoop libraries and related configurations installed on the cluster are distributed automatically to all of the Spark hosts in the cluster using AWS EMR features, and are mounted into the Docker containers automatically by YARN.

In addition, any binaries (–files, –jars, etc.) explicitly included by the user when the application is submitted are also made available via the distributed cache of Hadoop 3.

In the "classic" distributed application YARN cluster mode, a user (or a Step Function) submits a Spark job to be executed, which is scheduled and executed by YARN. The ApplicationMaster hosts the Spark driver, which is launched on the cluster in a Docker container (the executors are in Docker containers).

Spark containers on YARN (cluster - mode - from Cloudera documentation)

From pyspark-latest/Dockerfile we will build our docker image. You can build the image from your local laptop or (I prefer) use an EC2 instance.

You can use a private docker repository or a public registry.

For example, It is easy to create an ECR repository in your AWS account:

Take note of the result of this command:

Now, we will tag our locally Docker build image with our AWS EMR end point:

Get the login line for our AWS ECR account:

And run the output of the command (for example):

Now, we can push our locally build image to AWS ECR:

After this push is complete, the Docker image is available to use with your EMR cluster.

First of all we need to create a configuration file for our AWS EMR. We have two options:

  • copy the content of the file emr-configuration.json and paste it inside the CLI
  • create a local file

Remember to check the content of the file and replace the registry ID with your registry ID.

The config file contains and that is the list of trusted docker registries

To execute a Spark application in a Docker container, the following configuration options are necessary:

When using Amazon ECR to retrieve Docker images, you must configure the cluster to authenticate itself. To do so, you must use the following configuration option:

For EMR 6.1.0, you are not required to use the listed command YARN_CONTAINER_RUNTIME_DOCKER_CLIENT_CONFIG={DOCKER_CLIENT_CONFIG_PATH_ON_HDFS} when the ECR auto authentication feature is enabled.

After the configuration of the file we can create the cluster with the AWS CLI (for this example we will use a small cluster with 3 nodes: 1 master and 2 cores: the docker integration also works with task instances):

Configure ECR credentials on AWS EMR 6.0.0 (or when IAM Role doesn't have ecr:GetAuthorizationToken)

AWS EMR 6.0.0 doesn't support directly the auto-configuration for EMR and ECR integration. So, we have to configure it manually by creation of a config file om HDFS.

To add the config file, wait until the cluster is in then use your EC2 key pair to SSH into one of the of the cluster.

(note: if connection is not allowed, check the security group of the instance and enable SSH Port for your ip address)

Now we can get the credentials for AWS ECR. This credentials will used by AWS EMR to dowload the build image.

aws ecr get-login --region us-east-1 --no-include-email --profile yourprofile

Run the output command (like this):

This command generates a file in the folder. Put this file on HDFS:

Now YARN can access ECR as a Docker image registry and pull containers during job execution.

EMR 6.1.0 supports automatic authentication on EMR 6.1. If an Amazon ECR registry is detected in the container-executor classification key, the Amazon ECR auto authentication feature activates and YARN handles the authentication process when you submit a Spark job with an ECR image. Automatic authentication is enabled and the YARN authentication setting is set to true if the docker.trusted.registries contains an ECR registry URL.

Prerequisites for using automatic authentication to Amazon ECR:

  • EMR version 6.1.0 or later
  • ECR registry included in configuration is in the same Region with the cluster
  • IAM role with permissions to get authorization token and pull any image (if you are using EMR_Default_Role, this role has to be enabled to ecr:GetAuthorizationToken)

Now, we can create and EMR Notebook to test the solution.

To create a notebook, run this steps:

  1. Connect to AWS Console, and choose the EMR section
  2. Click on Notebooks in the EMR Console (in the left side bar of the screen)
  3. Choose a name for your notebook
  4. Click Choose an existing cluster and select the cluster you created few minutes ago
  5. Click on Create notebook button

When the Notebook is in the state, we can use the button to access to the notebooks. From the new button we can create a PySpark Kernel.

In the first cell, just configure the session:

In the second cell use the script from emrdocker.py

The output should look like the following screenshot; the numpy version in use is the latest.

Note: remember to stop the EMR Notebook and the cluster at the end of your work ;)

To run a job (for example add a step to the cluster) you need to specify the yarn container runtime:

To use Docker Hub or your custom private registry, note:

  • you must deploy your cluster to a public subnet
  • configure it to use Docker Hub as a trusted registry
Sours: https://github.com/mauropelucchi/aws-emr-docker-integration
An introduction to Amazon EMR - Amazon Web Services

Post Syndicated from Paul Codding original https://aws.amazon.com/blogs/big-data/run-spark-applications-with-docker-using-amazon-emr-6-0-0-beta/

The Amazon EMR team is excited to announce the public beta release of EMR 6.0.0 with Spark 2.4.3, Hadoop 3.1.0, Amazon Linux 2, and Amazon Corretto 8. With this beta release, Spark users can use Docker images from Docker Hub and Amazon Elastic Container Registry (Amazon ECR) to define environment and library dependencies. Using Docker, users can easily define their dependencies and use them for individual jobs, avoiding the need to install dependencies on individual cluster hosts.

This post shows you how to use Docker with the EMR release 6.0.0 Beta. You’ll learn how to launch an EMR release 6.0.0 Beta cluster and run Spark jobs using Docker containers from both Docker Hub and Amazon ECR.

Hadoop 3 Docker support

EMR 6.0.0 (Beta) includes Hadoop 3.1.0, which allows the YARN NodeManager to launch containers either directly on the host machine of the cluster, or inside a Docker container. Docker containers provide a custom execution environment in which the application’s code runs isolated from the execution environment of the YARN NodeManager and other applications.

These containers can include special libraries needed by the application, and even provide different versions of native tools and libraries such as R, Python, Python libraries. This allows you to easily define the libraries and runtime dependencies that your applications need, using familiar Docker tooling.

Clusters running the EMR 6.0.0 (Beta) release are configured by default to allow YARN applications such as Spark to run using Docker containers. To customize this, use the configuration for Docker support defined in the and files available in the directory. For details on each configuration option and how it is used, see Launching Applications Using Docker Containers.

You can choose to use Docker when submitting a job. On job submission, the following variables are used to specify the Docker runtime and Docker image used:

When you use Docker containers to execute your YARN applications, YARN downloads the Docker image specified when you submit your job. For YARN to resolve this Docker image, it must be configured with a Docker registry. Options to configure a Docker registry differ based on how you chose to deploy EMR (using either a public or private subnet).

Docker registries

A Docker registry is a storage and distribution system for Docker images. For EMR 6.0.0 (Beta), the following Docker registries can be configured:

  • Docker Hub: A public Docker registry containing over 100,000 popular Docker images.
  • Amazon ECR: A fully-managed Docker container registry that allows you to create your own custom images and host them in a highly available and scalable architecture.

Deployment considerations

Docker registries require network access from each host in the cluster, as each host downloads images from the Docker registry when your YARN application is running on the cluster. How you choose to deploy your EMR cluster (launching it into a public or private subnet) may limit your choice of Docker registry due to network connectivity requirements.

Public subnet

With EMR public subnet clusters, nodes running YARN NodeManager can directly access any registry available over the internet, such as Docker Hub, as shown in the following diagram.

Private Subnet

With EMR private subnet clusters, nodes running YARN NodeManager don’t have direct access to the internet.  Docker images can be hosted in the ECR and accessed through AWS PrivateLink, as shown in the following diagram.

For details on how to use AWS PrivateLink to allow access to ECR in a private subnet scenario, see Setting up AWS PrivateLink for Amazon ECS, and Amazon ECR.

Configuring Docker registries

Docker must be configured to trust the specific registry used to resolve Docker images. The default trust registries are (private) and (on public Docker Hub). You can override in to use other public repositories or ECR. To override this configuration, use the EMR Classification API with the classification key.

The following example shows how to configure the cluster to trust both a public repository () and an ECR registry (). When using ECR, replace this endpoint with your specific ECR endpoint.  When using Docker Hub, please replace this repository name with your actual repository name.

To launch an EMR 6.0.0 (Beta) cluster with this configuration using the AWS Command Line Interface (AWS CLI), create a file named with the contents of the preceding JSON configuration.  Then, use the following commands to launch the cluster:

Using ECR

If you’re new to ECR, follow the instructions in Getting Started with Amazon ECR and verify you have access to ECR from each instance in your EMR cluster.

To access ECR using the command, you must first generate credentials. To make sure that YARN can access images from ECR, pass a reference to those generated credentials using the container environment variable .

Run the following command on one of the core nodes to get the login line for your ECR account.

The command generates the correct Docker CLI command to run to create credentials. Copy and run the output from .

This command generates a file in the folder.  Copy this file to HDFS so that jobs submitted to the cluster can use it to authenticate to ECR.

Execute the commands below to copy the file to your home directory.



Execute the commands below to put the in HDFS so it may be used by jobs running on the cluster.

At this point, YARN can access ECR as a Docker image registry and pull containers during job execution.

Using Spark with Docker

With EMR 6.0.0 (Beta), Spark applications can use Docker containers to define their library dependencies, instead of requiring dependencies to be installed on the individual Amazon EC2 instances in the cluster. This integration requires configuration of the Docker registry, and definition of additional parameters when submitting a Spark application.

When the application is submitted, YARN invokes Docker to pull the specified Docker image and run the Spark application inside of a Docker container. This allows you to easily define and isolate dependencies. It reduces the time spent bootstrapping or preparing instances in the EMR cluster with the libraries needed for job execution.

When using Spark with Docker, make sure that you consider the following:

  • The package and CLI are only installed on core and task nodes.
  • The command should always be run from a master instance on the EMR cluster.
  • The Docker registries used to resolve Docker images must be defined using the Classification API with the classification key to define additional parameters when launching the cluster:
  • To execute a Spark application in a Docker container, the following configuration options are necessary:
  • When using ECR to retrieve Docker images, you must configure the cluster to authenticate itself. To do so, you must use the following configuration option:
  • Mount the file into the container so that the user running the job can be identified in the Docker container.
  • Any Docker image used with Spark must have Java installed in the Docker image.

Creating a Docker image

Docker images are created using a Dockerfile, which defines the packages and configuration to include in the image.  The following two example Dockerfiles use PySpark and SparkR.

PySpark Dockerfile

Docker images created from this Dockerfile include Python 3 and the Python package.  This Dockerfile uses Amazon Linux 2 and the Amazon Corretto JDK 8.

SparkR Dockerfile

Docker images created from this Dockerfile include R and the CRAN package. This Dockerfile includes Amazon Linux 2 and the Amazon Corretto JDK 8.

For more information on Dockerfile syntax, see the Dockerfile reference documentation.

Using Docker images from ECR

Amazon Elastic Container Registry (ECR) is a fully-managed Docker container registry that makes it easy for developers to store, manage, and deploy Docker container images. When using ECR, the cluster must be configured to trust your instance of ECR, and you must configure authentication in order for the cluster to use Docker images from ECR.

In this example, our cluster must be created with the following additional configuration, to ensure the ECR registry is trusted. Please replace the 123456789123.dkr.ecr.us-east-1.amazonaws.com endpoint with your ECR endpoint.

Using PySpark with ECR

This example uses the PySpark Dockerfile.  It will be tagged and upload to ECR. Once uploaded, you will run the PySpark job and reference the Docker image from ECR.

After you launch the cluster, use SSH to connect to a core node and run the following commands to build the local Docker image from the PySpark Dockerfile example.

First, create a directory and a Dockerfile for our example.

Paste the contents of the PySpark Dockerfile and run the following commands to build a Docker image.

Create the ECR repository for our examples.

Tag and upload the locally built image to ECR, replacing with your ECR endpoint.


Use SSH to connect to the master node and prepare a Python script with the filename . Paste the following content into the file and save it.

To submit the job, reference the name of the Docker. Define the additional configuration parameters to make sure that the job execution uses Docker as the runtime. When using ECR, the must reference the file containing the credentials used to authenticate to ECR.

When the job has completed, take note of the YARN application ID, and use the following command to obtain the output of the PySpark job.

Using SparkR with ECR

This example uses the SparkR Dockerfile. It will be tagged and upload to ECR. Once uploaded, you will run the SparkR job and reference the Docker image from ECR.

After you launch the cluster, use SSH to connect to a core node and run the following commands to build the local Docker image from the SparkR Dockerfile example.

First, create a directory and the Dockerfile for this example.

Paste the contents of the SparkR Dockerfile and run the following commands to build a Docker image.

Tag and upload the locally built image to ECR, replacing with your ECR endpoint.


Use SSH to connect to the master node and prepare an R script with name . Paste the following contents into the file.

To submit the job, reference the name of the Docker. Define the additional configuration parameters to make sure that the job execution uses Docker as the runtime. When using ECR, the must reference the file containing the credentials used to authenticate to ECR.

When the job has completed, note the YARN application ID, and use the following command to obtain the output of the SparkR job. This example includes testing to make sure that the library, version installed, and release notes are available.

Using a Docker image from Docker Hub

To use Docker Hub, you must deploy your cluster to a public subnet, and configure it to use Docker Hub as a trusted registry. In this example, the cluster needs the following additional configuration to to make sure that the repository on Docker Hub is trusted. When using Docker Hub, please replace this repository name with your actual repository.

Beta limitations

EMR 6.0.0 (Beta) focuses on helping you get value from using Docker with Spark to simplify dependency management. You can also use EMR 6.0.0 (Beta) to get familiar with Amazon Linux 2, and Amazon Corretto JDK 8.

The EMR 6.0.0 (Beta) supports the following applications:

  • Spark 2.4.3
  • Livy 0.6.0
  • ZooKeeper 3.4.14
  • Hadoop 3.1.0

This beta release is supported in the following Regions:

  • US East (N. Virginia)
  • US West (Oregon)

The following EMR features are currently not available with this beta release:

  • Cluster integration with AWS Lake Formation
  • Native encryption of Amazon EBS volumes attached to an EMR cluster

Conclusion

In this post, you learned how to use an EMR 6.0.0 (Beta) cluster to run Spark jobs in Docker containers and integrate with both Docker Hub and ECR. You’ve seen examples of both PySpark and SparkR Dockerfiles.

The EMR team looks forward to hearing about how you’ve used this integration to simplify dependency management in your projects. If you have questions or suggestions, feel free to leave a comment.


About the Authors

Paul Codding is a senior product manager for EMR at Amazon Web Services.

 

 

 

 

Ajay Jadhav is a software development engineer for EMR at Amazon Web Services.

 

 

 

 

Rentao Wu is a software development engineer for EMR at Amazon Web Services.

 

 

 

 

Stephen Wu is a software development engineer for EMR at Amazon Web Services.

 

 

 

 

Amazon EMRAnalyticsAWS Big DataEMRSours: https://noise.getoto.net/2019/09/06/run-spark-applications-with-docker-using-amazon-emr-6-0-0-beta/

Similar news:

Accessing S3 on AWS EMR when running application within Docker container via spark-submit

I'm running an application that accesses data in an AWS S3 folder. The following script serves as a simple proxy for my application that reproduces my error:

Simple enough, and it doesn't fail when I run the following command...

... on an EMR cluster.

When I, however, try running the same application within a Docker container as described here in the documentation, I get the following error:

The exact command I use is the one in the above documentation link, but with values relevant to my case:

I can confirm that the error has nothing to do with any configuration or access management matter, as my script runs and the error pops up at the call to the CSV read function.

It appears to me that I can solve this by resolving the EMRFS dependencies within the container, or directly via but I haven't been able to figure out how, and what exactly these dependencies are.

asked Jul 31 '20 at 10:01

user1953384user1953384

96922 gold badges1111 silver badges2727 bronze badges

Sours: https://stackoverflow.com/questions/63189713/accessing-s3-on-aws-emr-when-running-application-within-docker-container-via-spa


894 895 896 897 898