Reputation: 103
I am trying the word count problem in spark using python. But I am facing the problem when I try to save the output RDD in a text file using .saveAsTextFile command. Here is my code. Please help me. I am stuck. Appreciate for your time.
import re
from pyspark import SparkConf , SparkContext
def normalizewords(text):
return re.compile(r'\W+',re.UNICODE).split(text.lower())
conf=SparkConf().setMaster("local[2]").setAppName("sorted result")
sc=SparkContext(conf=conf)
input=sc.textFile("file:///home/cloudera/PythonTask/sample.txt")
words=input.flatMap(normalizewords)
wordsCount=words.map(lambda x: (x,1)).reduceByKey(lambda x,y: x+y)
sortedwordsCount=wordsCount.map(lambda (x,y):(y,x)).sortByKey()
results=sortedwordsCount.collect()
for result in results:
count=str(result[0])
word=result[1].encode('ascii','ignore')
if(word):
print word +"\t\t"+ count
results.saveAsTextFile("/var/www/myoutput")
Upvotes: 6
Views: 27110
Reputation: 11
Change results=sortedwordsCount.collect()
to results=sortedwordsCount
, because using .collect()
results will be a list.
Upvotes: 1
Reputation: 4375
since you collected results=sortedwordsCount.collect()
so, its not RDD. It will be normal python list or tuple.
As you know list
is python object/data structure and append
is method to add element.
>>> x = []
>>> x.append(5)
>>> x
[5]
Similarly
RDD
is sparks object/data structure andsaveAsTextFile
is method to write the file. Important thing is its distributed data structure.
So, we cannot use append
on RDD or saveAsTextFile
on list. collect
is method on RDD to get to RDD to driver memory.
As mentioned in comments, save sortedwordsCount
with saveAsTextFile or open file in python and use results
to write in a file
Upvotes: 8