Spark using Python : save RDD output into text files

Question

I am trying the word count problem in spark using python. But I am facing the problem when I try to save the output RDD in a text file using .saveAsTextFile command. Here is my code. Please help me. I am stuck. Appreciate for your time.

import re

from pyspark import SparkConf , SparkContext

def normalizewords(text):
    return re.compile(r'\W+',re.UNICODE).split(text.lower())

conf=SparkConf().setMaster("local[2]").setAppName("sorted result")
sc=SparkContext(conf=conf)

input=sc.textFile("file:///home/cloudera/PythonTask/sample.txt")

words=input.flatMap(normalizewords)

wordsCount=words.map(lambda x: (x,1)).reduceByKey(lambda x,y: x+y)

sortedwordsCount=wordsCount.map(lambda (x,y):(y,x)).sortByKey()

results=sortedwordsCount.collect()

for result in results:
    count=str(result[0])
    word=result[1].encode('ascii','ignore')

    if(word):
        print word +"		"+ count

results.saveAsTextFile("/var/www/myoutput")

WoodChopper · Accepted Answer

since you collected results=sortedwordsCount.collect() so, its not RDD. It will be normal python list or tuple.

As you know list is python object/data structure and append is method to add element.

>>> x = []
>>> x.append(5)
>>> x
[5]

Similarly RDD is sparks object/data structure and saveAsTextFile is method to write the file. Important thing is its distributed data structure.

So, we cannot use append on RDD or saveAsTextFile on list. collect is method on RDD to get to RDD to driver memory.

As mentioned in comments, save sortedwordsCount with saveAsTextFile or open file in python and use results to write in a file

Spark using Python : save RDD output into text files

Answers (2)

Related Questions