Andrey
Andrey

Reputation: 6367

How to upload files to Amazon EMR?

My code is as follows:

# test2.py

from pyspark import SparkContext, SparkConf, SparkFiles
conf = SparkConf()
sc = SparkContext(
    appName="test",
    conf=conf)
from pyspark.sql import SQLContext
sqlc = SQLContext(sparkContext=sc)
with open(SparkFiles.get("test_warc.txt")) as f:
  print("opened")
sc.stop()

It works when I run it locally with:

spark-submit --deploy-mode client --files ../input/test_warc.txt test2.py

But after adding step to EMR claster:

spark-submit --deploy-mode cluster --files s3://brand17-stock-prediction/test_warc.txt s3://brand17-stock-prediction/test2.py

I am getting error:

FileNotFoundError: [Errno 2] No such file or directory: '/mnt1/yarn/usercache/hadoop/appcache/application_1618078674774_0001/spark-e7c93ba0-7d30-4e52-8f1b-14dda6ff599c/userFiles-5bb8ea9f-189d-4256-803f-0414209e7862/test_warc.txt'

Path to the file is correct, but it is not uploading from s3 for some reason.

I tried to load from executor:

from pyspark import SparkContext, SparkConf, SparkFiles
from operator import add

conf = SparkConf()
sc = SparkContext(
    appName="test",
    conf=conf)
from pyspark.sql import SQLContext
sqlc = SQLContext(sparkContext=sc)
def f(_):
    a = 0
    with open(SparkFiles.get("test_warc.txt")) as f:
      a += 1
      print("opened")
    return a#test_module.test()
count = sc.parallelize(range(1, 3), 2).map(f).reduce(add)
print(count) # printing 2

sc.stop()

And it works without errors.

Looking like --files argument uploading files to executors only. How can I upload to master ?

Upvotes: 2

Views: 1365

Answers (1)

Ajay Kr Choudhary
Ajay Kr Choudhary

Reputation: 1352

Your understanding is correct.

--files argument is uploading files to executors only.

See this in the spark documentation

file: - Absolute paths and file:/ URIs are served by the driver’s HTTP file server, and every executor pulls the file from the driver HTTP server.

You can read more about this at advanced-dependency-management

Now coming back to your second question

How can I upload to master?

There is a concept of bootstrap-action in EMR. From the official documentation it means the following:

You can use a bootstrap action to install additional software or customize the configuration of cluster instances. Bootstrap actions are scripts that run on cluster after Amazon EMR launches the instance using the Amazon Linux Amazon Machine Image (AMI). Bootstrap actions run before Amazon EMR installs the applications that you specify when you create the cluster and before cluster nodes begin processing data.

How do I use it in my case?

While spawning the cluster you can specify your script in BootstrapActions JSON Something like the following along with other custom configurations:

BootstrapActions=[
            {'Name': 'Setup Environment for downloading my script',
             'ScriptBootstrapAction':
                 {
                     'Path': 's3://your-bucket-name/path-to-custom-scripts/download-file.sh'
                 }
             }]

The content of the download-file.sh should look something like below:

#!/bin/bash
set -x
workingDir=/opt/your-path/
sudo mkdir -p $workingDir
sudo aws s3 cp s3://your-bucket/path-to-your-file/test_warc.txt $workingDir

Now in your python script, you can use the file workingDir/test_warc.txt to read the file.

There is also an option to execute your bootstrap action on the master node only/task node only or a mix of both. bootstrap-actions/run-if is the script that we can use for this case. More reading on this can be done at emr-bootstrap-runif

Upvotes: 1

Related Questions