Submitting a pyspark job to Amazon EMR cluster from terminal

Question

I have SSH-ed into the Amazon EMR server and I want to submit a Spark job ( a simple word count file and a sample.txt are both on the Amazon EMR server ) written in Python from the terminal. How do I do this and what's the syntax?

The word_count.py is as follows:

from pyspark import SparkConf, SparkContext

from operator import add
import sys
## Constants
APP_NAME = " HelloWorld of Big Data"
##OTHER FUNCTIONS/CLASSES

def main(sc,filename):
   textRDD = sc.textFile(filename)
   words = textRDD.flatMap(lambda x: x.split(' ')).map(lambda x: (x, 1))
   wordcount = words.reduceByKey(add).collect()
   for wc in wordcount:
      print (wc[0],wc[1])

if __name__ == "__main__":

   # Configure Spark
   conf = SparkConf().setAppName(APP_NAME)
   conf = conf.setMaster("local[*]")
   sc   = SparkContext(conf=conf)
   sc._jsc.hadoopConfiguration().set("fs.s3.awsAccessKeyId","XXXX")
   sc._jsc.hadoopConfiguration().set("fs.s3.awsSecretAccessKey","YYYY")
   filename = "s3a://bucket_name/sample.txt"
   # filename = sys.argv[1]
   # Execute Main functionality
   main(sc, filename)

Submitting a pyspark job to Amazon EMR cluster from terminal

Answers (1)

Related Questions