Reshma Yadav
Reshma Yadav

Reputation: 53

Frequency of Words in a file using Python

I'm trying to count the number of words by using a python program .

from pyspark import SparkContext

sc = SparkContext(appName="Words")
lines = sc.textFile(sys.argv[1], 1)
counts=dict()
words = lines.split(" ")
for word in words:
    if word in counts:
        counts[word] += 1
    else:
        counts[word] = 1

output = counts.collect()
for (word, count) in output:
    print "%s: %i" % (word, count)

sc.stop()

This was not giving me the desired output . Can there be any improvement to this code ?

Upvotes: 0

Views: 2663

Answers (1)

pault
pault

Reputation: 43504

It seems like you are mixing up python and spark.

When you read the file using pyspark.SparkContext.textFile() you will get an RDD of strings. Quoting myself from an answer to a different question:

All of the operations you want to do are on the contents of the RDD, the elements of the file. Calling split() on the RDD doesn't make sense, because split() is a string function. What you want to do instead is call split() and the other operations on each record (line in the file) of the RDD. This is exactly what map() does.

Here is how you can modify your code to count the word frequencies using pySpark.

First we will map each word w in each line into a tuple of the form (w, 1). Then we will call reduceByKey() and add the counts for each word.

For example, if the line was "The quick brown fox jumps over the lazy dog", the map step would turn this line into:

[('The', 1), ('quick', 1), ('brown', 1), ('fox', 1), ('jumps', 1), ('over', 1),
 ('the', 1), ('lazy', 1), ('dog', 1)]

Since this returns a list of tuples, we will call flatMap() so that each tuple is considered as a unique record. One other thing to think about here is whether or not you want the counts to be case sensitive and if there is any punctuation or special characters to remove.

After the flatMap(), we can call reduceByKey() which gathers all tuples with the same key (in this case the word) and applies the reduce function on the value (in this case operator.add()).

from pyspark import SparkContext
from operator import add

sc = SparkContext(appName="Words")
lines = sc.textFile(sys.argv[1], 1)  # this is an RDD

# counts is an rdd is of the form (word, count)
counts = lines.flatMap(lambda x: [(w.lower(), 1) for w in x.split()]).reduceByKey(add)

# collect brings it to a list in local memory
output = counts.collect()
for (word, count) in output:
    print "%s: %i" % (word, count)

sc.stop()  # stop the spark context

Upvotes: 2

Related Questions