simonslav
simonslav

Reputation: 13

Submitting pyspark supporting sql files inside zip file on AWS EMR

I am looking to reference non-python files (e.g., SQL, config, txt) saved as .zip on S3 in my pyspark application on Amazon EMR. I have tried --py-files, but that only worked with my python files. I am still unable to use my zipped SQL/config files from S3 in Amazon EMR. Does anyone have any solutions to this?

Upvotes: 1

Views: 1477

Answers (1)

Ryan Widmaier
Ryan Widmaier

Reputation: 8513

The flag you are looking for --archives. Basically you give it a zip file and it will extract it into the directory each yarn container is executing in. You should be able to access them using relative paths in your script.

You can also control the name of the folder your zip is unzipped to by adding #{name} to the end. For example --archives s3://aaa/some.zip#files. Spark only mentions this in passing here:

https://spark.apache.org/docs/latest/running-on-yarn.html#important-notes

One thing to be aware of, if you are running your with --deploy-mode client then your driver is not running a yarn container, and therefore will not have access to the files. You will instead want to use --deploy-mode cluster.

Upvotes: 1

Related Questions