Vladimir Shadrin
Vladimir Shadrin

Reputation: 376

How to include external python modules with pyspark

I'm new in python and trying to launch my pyspark project on spark on AWS EMR. The project is disposed on AWS S3 and has several python files, like this:

/folder1
 - main.py
/utils
 - utils1.py
 - utils2.py

I use the following command:

spark-submit --py-files s3://bucket/utils s3://bucket/folder1/main.py

But I get the error:

Traceback (most recent call last):
  File "/mnt/tmp/spark-1e38eb59-3ddd-4deb-8529-eace7465b6ce/main.py", line 15, in <module>
    from utils.utils1 import foo
ModuleNotFoundError: No module named 'utils'

What I have to fix in my command? I know that I can pack my project in zip file, but now I need to do it without packing, however I'll be grateful if you tell me both solutions.

UPD:

EMR cluster's controller log says, that launching command looks like this:

hadoop jar /var/lib/aws/emr/step-runner/hadoop-jars/command-runner.jar spark-submit --packages org.apache.spark:spark-avro_2.12:3.1.1 --driver-memory 100G --conf spark.driver.maxResultSize=100G --conf spark.hadoop.fs.s3.maxRetries=20 --conf spark.sql.sources.partitionOverwriteMode=dynamic --py-files s3://bucket/dir1/dir2/utils.zip --master yarn s3://bucket/dir1/dir2/dir3/main.py --args

But now I have the following error: java.io.FileNotFoundException: File file:/mnt/var/lib/hadoop/steps/cluster-id/dir1/dir2/utils.zip does not exist

What's wrong?

Upvotes: 1

Views: 2532

Answers (1)

A.B
A.B

Reputation: 20445

Although not recommended (see the complete answer for a better option), but if you do not want to zip files. Instead of providing utils folder, you can provide individual utils-* files with py-files with comma-separated syntax before the actual file as

'Args': ['spark-submit',
                '--py-files',
                '{your_s3_path_here}/utils/utils1.py,{your_s3_path_here}/utils/utils1.py',
                'main.py']
        }

Better to zip utils folder

You can zip utils and include like this

To do so, make empty __init__.py file at root level in utils like utils/__init__.py )

From outside this directory, make a zip of it (for example utils.zip)

For submission, you can add this zip as

'Args': ['spark-submit',
                '--py-files',
                '{your_s3_path_here}/utils.zip',
                'main.py'
        }

Considering your have __init__.py , utils1.py, utils2.py in utils.zip

Note: You might also need to add this zip to sc with sc.addPyFile("utils.zip") before following imports

You can now use them as

from utils.utils1 import *
from utils.utils2 import *

Upvotes: 4

Related Questions