Reputation: 326
I have 4 python scripts and one configuration file of .txt . out of 4 python files , one file has entry point for spark application and also importing functions from other python files . But configuration file is imported in some other python file that is not entry point for spark application . I want to write spark submit command in pyspark , but I am not sure how to provide multiple files along configuration file with spark submit command when configuration file is not python file but text file or ini file.
for demonstration: 4 python files : file1.py , file2.py , file3.py . file4.py
1 configuration file : conf.txt
file1.py : this file has spark session and calling to all other python files . file3.py : this python file is reading conf.txt .
I want to provide all these files with spark submit but not sure about command . command I have tried is below :
'Args': ['spark-submit',
'--deploy-mode', 'cluster',
'--master', 'yarn',
'--executor-memory',
conf['emr_step_executor_memory'],
'--executor-cores',
conf['emr_step_executor_cores'],
'--conf',
'spark.yarn.submit.waitAppCompletion=true',
'--conf',
'spark.rpc.message.maxSize=1024',
f'{s3_path}/file1.py',
'--py-files',
f'{s3_path}/file2.py',
f'{s3_path}/file3.py',
f'{s3_path}/file4.py',
'--files',
f'{s3_path}/config.txt'
]
but above command is throwing an error : File "file1.py", line 3, in from file2 * ModuleNotFoundError: No module named 'file2'
Upvotes: 0
Views: 3468
Reputation: 20445
option 1 Put py-files with comma separated syntax before the actual file as
'Args': ['spark-submit',
'--py-files',
'file2.py,file3.py,file4.py',
'file1.py',
'--files',
f'{s3_path}/config.txt]
}
In your case it may be like (f'{s3_path}/file2.py,{s3_path}/file3.py,{s3_path}/file4.py'
)
Now to include text file
sc.textFile("config.txt")
Option 2: Zipping the files
moreover you can zip
them and include like this
First put them in a directory for instance myfiles/
( In addition make empty __init__.py
file at root level in this directory like myfiles/__init__.py )
From outside this directory,make a zip of it (for example myfiles.zip
)
For submission, you can add this zip as
'Args': ['spark-submit',
'--py-files',
'myfiles.zip',
'file1.py'
}
Now include this zip with sc.addPyFile
function
sc.addPyFile("myfiles.zip")
Considering your have __init__.py
, file2.py
, file3.py
, file4.py
and config.txt
in myfiles.zip
You can now use them as
from myfiles.File1 import *
from myfiles.File2 import *
Update: you asked that
in option2 : do i need to provide path for '--py-files', 'myfiles.zip', with spark submit or in sc.addPyFile() or with both ?
yes you need to provide path of the myfile.zip
like, /home/hadoop/myfiles.zip
, so this means that you need to have this file master node, you can either do it with bootstrap script
to copy it from s3,
or have a step to copy these
{
'Name': 'setup - copy files4',
'ActionOnFailure': 'TERMINATE_CLUSTER',
'HadoopJarStep': {
'Jar': 'command-runner.jar',
'Args': ['aws', 's3', 'cp',
YOUR_S3_URI + 'myfiles.zip',
'/home/hadoop/']
}
}
Upvotes: 3