How to execute a python file (.py) on hadoop distributed file system (hdfs)

Question

Hadoop offers the possibility to run java applications directly on the cluster using

hadoop jar

Now I have a python script instead of a java application.

Below is the construct of the .py file without the all the functionality, just left the "remove-files-from-folder" part

import os.path

def transform():
    inputfolder = "input"
    for filename in os.listdir(inputfolder):
        path = inputfolder + "\" + filename
        os.remove(path)
def main():
    transform()
if __name__ == "__main__":  main()

Is there a way to execute the .py file similarly to the way I'd execute a .jar file?

I'm new to Python and Hadoop. If my approach seems totally off and doesn't make sense, I'm happy for any kind of clarification!

mjrice04 · Accepted Answer

If you are simply looking to distribute your python script across the cluster then you want to use Hadoop Streaming.

The basic syntax of the command looks like (from https://hadoop.apache.org/docs/r1.2.1/streaming.html):

$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
-input myInputDirs \
-output myOutputDir \
-mapper myPythonScript.py \
-file myPythonScript.py

This basically creates a map-reduce job for your python script

How to execute a python file (.py) on hadoop distributed file system (hdfs)

Answers (1)

Related Questions