Reputation: 369
Hadoop offers the possibility to run java applications directly on the cluster using
hadoop jar <jar>
Now I have a python script instead of a java application.
Below is the construct of the .py file without the all the functionality, just left the "remove-files-from-folder" part
import os.path
def transform():
inputfolder = "input"
for filename in os.listdir(inputfolder):
path = inputfolder + "\\" + filename
os.remove(path)
def main():
transform()
if __name__ == "__main__": main()
Is there a way to execute the .py file similarly to the way I'd execute a .jar file?
I'm new to Python and Hadoop. If my approach seems totally off and doesn't make sense, I'm happy for any kind of clarification!
Upvotes: 1
Views: 12815
Reputation: 164
If you are simply looking to distribute your python script across the cluster then you want to use Hadoop Streaming.
The basic syntax of the command looks like (from https://hadoop.apache.org/docs/r1.2.1/streaming.html):
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-input myInputDirs \
-output myOutputDir \
-mapper myPythonScript.py \
-file myPythonScript.py
This basically creates a map-reduce job for your python script
Upvotes: 2