Jay
Jay

Reputation: 326

pyspark submit command on AWS EMR

I have 4 python scripts and one configuration file of .txt . out of 4 python files , one file has entry point for spark application and also importing functions from other python files . But configuration file is imported in some other python file that is not entry point for spark application . I want to write spark submit command in pyspark , but I am not sure how to provide multiple files along configuration file with spark submit command when configuration file is not python file but text file or ini file.

for demonstration: 4 python files : file1.py , file2.py , file3.py . file4.py

1 configuration file : conf.txt

file1.py : this file has spark session and calling to all other python files . file3.py : this python file is reading conf.txt .

I want to provide all these files with spark submit but not sure about command . command I have tried is below :

'Args': ['spark-submit',
                         '--deploy-mode', 'cluster',
                         '--master', 'yarn',
                         '--executor-memory',
                         conf['emr_step_executor_memory'],
                         '--executor-cores',
                         conf['emr_step_executor_cores'],
                         
                         '--conf',
                         'spark.yarn.submit.waitAppCompletion=true',
                         '--conf',
                         'spark.rpc.message.maxSize=1024',
                        
                         f'{s3_path}/file1.py', 
                         '--py-files',
                         f'{s3_path}/file2.py',
                         f'{s3_path}/file3.py',
                         f'{s3_path}/file4.py',
                         '--files',
                         f'{s3_path}/config.txt'
                        
                         
                        ]

but above command is throwing an error : File "file1.py", line 3, in from file2 * ModuleNotFoundError: No module named 'file2'

Upvotes: 0

Views: 3468

Answers (1)

A.B
A.B

Reputation: 20445

option 1 Put py-files with comma separated syntax before the actual file as

'Args': ['spark-submit',
                '--py-files',
                'file2.py,file3.py,file4.py',
                'file1.py',
                '--files',
                 f'{s3_path}/config.txt]
        }

In your case it may be like (f'{s3_path}/file2.py,{s3_path}/file3.py,{s3_path}/file4.py') Now to include text file

   sc.textFile("config.txt") 

Option 2: Zipping the files

moreover you can zip them and include like this

First put them in a directory for instance myfiles/ ( In addition make empty __init__.py file at root level in this directory like myfiles/__init__.py )

From outside this directory,make a zip of it (for example myfiles.zip)

For submission, you can add this zip as

'Args': ['spark-submit',
                '--py-files',
                'myfiles.zip',
                'file1.py'
        }

Now include this zip with sc.addPyFilefunction

sc.addPyFile("myfiles.zip")

Considering your have __init__.py , file2.py, file3.py, file4.py and config.txt in myfiles.zip

You can now use them as

from myfiles.File1 import *
from myfiles.File2 import *

Update: you asked that

in option2 : do i need to provide path for '--py-files', 'myfiles.zip', with spark submit or in sc.addPyFile() or with both ?

yes you need to provide path of the myfile.zip like, /home/hadoop/myfiles.zip, so this means that you need to have this file master node, you can either do it with bootstrap script to copy it from s3,

or have a step to copy these

{
        'Name': 'setup - copy files4',
        'ActionOnFailure': 'TERMINATE_CLUSTER',
        'HadoopJarStep': {
            'Jar': 'command-runner.jar',
            'Args': ['aws', 's3', 'cp',
                YOUR_S3_URI + 'myfiles.zip',
                '/home/hadoop/']
        }
    }

Upvotes: 3

Related Questions