Reputation: 365
I'm trying to use pyspark to count the number of occurrences.
Suppose I have data like this:
data = sc.parallelize([(1,[u'a',u'b',u'd']),
(2,[u'a',u'c',u'd']),
(3,[u'a']) ])
count = sc.parallelize([(u'a',0),(u'b',0),(u'c',0),(u'd',0)])
Is possible to count the number of occurrences in data
and update in count
?
The result should be like [(u'a',3),(u'b',1),(u'c',1),(u'd',2)]
.
Upvotes: 6
Views: 8420
Reputation:
I would use Counter
:
>>> from collections import Counter
>>>
>>> data.values().map(Counter).reduce(lambda x, y: x + y)
Counter({'a': 3, 'b': 1, 'c': 1, 'd': 2})
Upvotes: 6
Reputation: 51
RDDs are immutable and thus cannot be updated. Instead, you compute the count based on your data as:
count = (rdd
.flatMap(lambda (k, data): data)
.map(lambda w: (w,1))
.reduceByKey(lambda a, b: a+b))
Then, if the result can fit in master main memory feel free to .collect() from count.
Upvotes: 3
Reputation: 37928
You wouldn't update count
since RDDs are immutable. Just run the calculation you want and then save directly to any variable you want:
In [17]: data.flatMap(lambda x: x[1]).map(lambda x: (x, 1)).reduceByKey(lambda x, y: x + y).collect()
Out[17]: [('b', 1), ('c', 1), ('d', 2), ('a', 3)]
Upvotes: 1