Reputation: 6367
My code is as follows:
# test2.py
from pyspark import SparkContext, SparkConf, SparkFiles
conf = SparkConf()
sc = SparkContext(
appName="test",
conf=conf)
from pyspark.sql import SQLContext
sqlc = SQLContext(sparkContext=sc)
with open(SparkFiles.get("test_warc.txt")) as f:
print("opened")
sc.stop()
It works when I run it locally with:
spark-submit --deploy-mode client --files ../input/test_warc.txt test2.py
But after adding step to EMR claster:
spark-submit --deploy-mode cluster --files s3://brand17-stock-prediction/test_warc.txt s3://brand17-stock-prediction/test2.py
I am getting error:
FileNotFoundError: [Errno 2] No such file or directory: '/mnt1/yarn/usercache/hadoop/appcache/application_1618078674774_0001/spark-e7c93ba0-7d30-4e52-8f1b-14dda6ff599c/userFiles-5bb8ea9f-189d-4256-803f-0414209e7862/test_warc.txt'
Path to the file is correct, but it is not uploading from s3 for some reason.
I tried to load from executor:
from pyspark import SparkContext, SparkConf, SparkFiles
from operator import add
conf = SparkConf()
sc = SparkContext(
appName="test",
conf=conf)
from pyspark.sql import SQLContext
sqlc = SQLContext(sparkContext=sc)
def f(_):
a = 0
with open(SparkFiles.get("test_warc.txt")) as f:
a += 1
print("opened")
return a#test_module.test()
count = sc.parallelize(range(1, 3), 2).map(f).reduce(add)
print(count) # printing 2
sc.stop()
And it works without errors.
Looking like --files
argument uploading files to executors only. How can I upload to master ?
Upvotes: 2
Views: 1365
Reputation: 1352
Your understanding is correct.
--files argument is uploading files to executors only.
See this in the spark documentation
file: - Absolute paths and file:/ URIs are served by the driver’s HTTP file server, and every executor pulls the file from the driver HTTP server.
You can read more about this at advanced-dependency-management
Now coming back to your second question
How can I upload to master?
There is a concept of bootstrap-action in EMR. From the official documentation it means the following:
You can use a bootstrap action to install additional software or customize the configuration of cluster instances. Bootstrap actions are scripts that run on cluster after Amazon EMR launches the instance using the Amazon Linux Amazon Machine Image (AMI). Bootstrap actions run before Amazon EMR installs the applications that you specify when you create the cluster and before cluster nodes begin processing data.
How do I use it in my case?
While spawning the cluster you can specify your script in BootstrapActions
JSON Something like the following along with other custom configurations:
BootstrapActions=[
{'Name': 'Setup Environment for downloading my script',
'ScriptBootstrapAction':
{
'Path': 's3://your-bucket-name/path-to-custom-scripts/download-file.sh'
}
}]
The content of the download-file.sh
should look something like below:
#!/bin/bash
set -x
workingDir=/opt/your-path/
sudo mkdir -p $workingDir
sudo aws s3 cp s3://your-bucket/path-to-your-file/test_warc.txt $workingDir
Now in your python script, you can use the file workingDir/test_warc.txt
to read the file.
There is also an option to execute your bootstrap action on the master node only/task node only or a mix of both. bootstrap-actions/run-if
is the script that we can use for this case. More reading on this can be done at emr-bootstrap-runif
Upvotes: 1