Reputation: 742
I have an array of Strings (called 'words') in a Dataframe.
If I type 'words' on the PySpark console I got:
DataFrame[words: array<string>]
Each element is comma separated. Now, given this array I want to find out their frequency in this way:
count = words.flatMap(lambda line: line.split(',')).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b)
Now, when I try to print my result (using .first, .collect, .take(n)) I have this error:
PythonRDD[292] at RDD at PythonRDD.scala:43
Is it possible to count words frequency using the split function with a comma? Or, maybe, there are other ways?
Upvotes: 1
Views: 3639
Reputation: 11
Try this in pyspark
from pyspark.sql.functions import col, explode
words(explode(col('words')).alias('word'))\
.groupBy('word').count().orderBy('count',ascending=False).show(100, truncate=False)
Upvotes: 1
Reputation: 214927
The words column is already an array, so you can't split it, perhaps you mean this?
words.show()
#+------------------+
#| words|
#+------------------+
#|[, ", an, animate]|
#|[, ", battlefield]|
#| [, ", be, gentle]|
#+------------------+
words.rdd.flatMap(lambda a: [(w,1) for w in a.words]).reduceByKey(lambda a,b: a+b).collect()
# [(u'', 3), (u'gentle', 1), (u'battlefield', 1), (u'be', 1), (u'animate', 1), (u'"', 3), (u'an', 1)]
Or a simpler method use countByValue
after flattening the words column, if the result can fit the driver memory:
words.rdd.flatMap(lambda a: a.words).countByValue()
# defaultdict(<type 'int'>, {u'': 3, u'battlefield': 1, u'"': 3, u'be': 1, u'an': 1, u'gentle': 1, u'animate': 1})
Upvotes: 5