Reputation: 376
I'm new in python and trying to launch my pyspark project on spark on AWS EMR. The project is disposed on AWS S3 and has several python files, like this:
/folder1
- main.py
/utils
- utils1.py
- utils2.py
I use the following command:
spark-submit --py-files s3://bucket/utils s3://bucket/folder1/main.py
But I get the error:
Traceback (most recent call last):
File "/mnt/tmp/spark-1e38eb59-3ddd-4deb-8529-eace7465b6ce/main.py", line 15, in <module>
from utils.utils1 import foo
ModuleNotFoundError: No module named 'utils'
What I have to fix in my command? I know that I can pack my project in zip file, but now I need to do it without packing, however I'll be grateful if you tell me both solutions.
UPD:
EMR cluster's controller log says, that launching command looks like this:
hadoop jar /var/lib/aws/emr/step-runner/hadoop-jars/command-runner.jar spark-submit --packages org.apache.spark:spark-avro_2.12:3.1.1 --driver-memory 100G --conf spark.driver.maxResultSize=100G --conf spark.hadoop.fs.s3.maxRetries=20 --conf spark.sql.sources.partitionOverwriteMode=dynamic --py-files s3://bucket/dir1/dir2/utils.zip --master yarn s3://bucket/dir1/dir2/dir3/main.py --args
But now I have the following error:
java.io.FileNotFoundException: File file:/mnt/var/lib/hadoop/steps/cluster-id/dir1/dir2/utils.zip does not exist
What's wrong?
Upvotes: 1
Views: 2532
Reputation: 20445
Although not recommended (see the complete answer for a better option), but if you do not want to zip files. Instead of providing utils
folder, you can provide individual utils-* files with py-files with comma-separated syntax before the actual file as
'Args': ['spark-submit',
'--py-files',
'{your_s3_path_here}/utils/utils1.py,{your_s3_path_here}/utils/utils1.py',
'main.py']
}
Better to zip utils folder
You can zip
utils and include like this
To do so, make empty __init__.py
file at root level in utils
like utils/__init__.py )
From outside this directory, make a zip of it (for example utils.zip
)
For submission, you can add this zip as
'Args': ['spark-submit',
'--py-files',
'{your_s3_path_here}/utils.zip',
'main.py'
}
Considering your have __init__.py
, utils1.py
, utils2.py
in utils.zip
Note: You might also need to add this zip to sc with sc.addPyFile("utils.zip")
before following imports
You can now use them as
from utils.utils1 import *
from utils.utils2 import *
Upvotes: 4