Rob
Rob

Reputation: 369

How to execute a python file (.py) on hadoop distributed file system (hdfs)

Hadoop offers the possibility to run java applications directly on the cluster using

hadoop jar <jar>

Now I have a python script instead of a java application.

Below is the construct of the .py file without the all the functionality, just left the "remove-files-from-folder" part

import os.path

def transform():
    inputfolder = "input"
    for filename in os.listdir(inputfolder):
        path = inputfolder + "\\" + filename
        os.remove(path)
def main():
    transform()
if __name__ == "__main__":  main()

Is there a way to execute the .py file similarly to the way I'd execute a .jar file?

I'm new to Python and Hadoop. If my approach seems totally off and doesn't make sense, I'm happy for any kind of clarification!

Upvotes: 1

Views: 12815

Answers (1)

mjrice04
mjrice04

Reputation: 164

If you are simply looking to distribute your python script across the cluster then you want to use Hadoop Streaming.

The basic syntax of the command looks like (from https://hadoop.apache.org/docs/r1.2.1/streaming.html):

$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
-input myInputDirs \
-output myOutputDir \
-mapper myPythonScript.py \
-file myPythonScript.py

This basically creates a map-reduce job for your python script

Upvotes: 2

Related Questions