Reputation: 13
I am looking to reference non-python files (e.g., SQL, config, txt) saved as .zip on S3 in my pyspark application on Amazon EMR. I have tried --py-files, but that only worked with my python files. I am still unable to use my zipped SQL/config files from S3 in Amazon EMR. Does anyone have any solutions to this?
Upvotes: 1
Views: 1477
Reputation: 8513
The flag you are looking for --archives
. Basically you give it a zip file and it will extract it into the directory each yarn container is executing in. You should be able to access them using relative paths in your script.
You can also control the name of the folder your zip is unzipped to by adding #{name}
to the end. For example --archives s3://aaa/some.zip#files
. Spark only mentions this in passing here:
https://spark.apache.org/docs/latest/running-on-yarn.html#important-notes
One thing to be aware of, if you are running your with --deploy-mode client
then your driver is not running a yarn container, and therefore will not have access to the files. You will instead want to use --deploy-mode cluster
.
Upvotes: 1