Spark/Python, reduceByKey() then find top 10 most frequent words and frequencies

Question

I have a VirtualMachine setup with Hadoop + Spark and I'm reading a text file "words.txt" from my HDFS and then calling map(), flatmap(), then reduceByKey() and attempting to get the Top 10 most frequent words and their occurrences. I already have the majority of the code complete, and then the tuple list are being aggregated but I just need a way to find the top 10. I know I need to simply iterate through the value in the tuple (key is the actual str word, but the value is the integer of how many times the word appeared in the words.txt file) and just have a counter that counts the top 10. The (K,V) value pair is Key = word from words.txt, and Value = integer aggregated value for how many times it appeared in the file. This screenshot below is after reduceByKey() had already been called, you can see 'the' appears 40 times (and the end of the screen shot to the right)

Here's the output:

Here's my code so far:

from pyspark import SparkcConf, SparkContext

# Spark set-up
conf = SparkConf()
conf.setAppName("Word count App")
sc = SparkContext(conf=conf)

# read from text file words.txt on HDFS
rdd = sc.textFile("/user/spark/words.txt")

# flatMap() to output multiple elements for each input value, split on space and make each word lowercase
rdd = rdd.flatMap(lamda x: x.lower().split(' '))

# Map a tuple and append int 1 for each word in words.txt
rdd = rdd.map(lamda x: (x,1))

# Perform aggregation (sum) all the int values for each unique key)
rdd = rdd.reduceByKey(lamda x, y: x+y)

# This is where I need a function or lambda to sort by descending order so I can grab the top 10 outputs, then print them out below with for loop

# for item in out:
print(item[0], '	:	', str(item[1]))

I know I would normally just make a variable called "max" and only update it if the max had be found in the list or tuple, but whats confusing me is I 'm dealing with Spark and RDD so I have been getting errors because I was a bit confused what the RDDs are returning when they do map, flatmap, reduceByKey etc...

Any help much appreciated

blackbishop · Accepted Answer

You can invert the K,V after reduce so that you could use sortByKey function:

rdd.map(lambda (k,v): (v,k)).sortByKey(False).take(10)

For Python 3: (as tuple unpacking in lambda expressions is no longer supported)

rdd.map(lambda x: (x[1], x[0])).sortByKey(False).take(10)

Spark/Python, reduceByKey() then find top 10 most frequent words and frequencies

Answers (1)

Related Questions