Reputation: 809
I have a list of lists containing strings and I want to count in how many of those lists each element appears:
list_of_lists = [["dog", "cow"], ["dragon", "ox", "cow"], ["fox", "cow", "dog"]]
So, cow
appears in 3 lists, dog
appears in 2 etc.
For such a small dataset, I would normally do:
from collections import Counter
from itertools import chain
count = Counter(chain.from_iterable(set(x) for x in list_of_lists))
and thus:
print(count["dog"])
2
However, I want to do that for a large dataset using PySpark and MapReduce so that for each element in the list of lists, I would have the above counter value:
[("dog", 2),
("cow", 3),
("dragon", 1),
("ox", 1),
("fox", 1)]
etc.
I am trying things like:
list_of_lists = sc.parallelize(list_of_lists)
list_occurencies = list_of_lists.map(lambda x: x, count[x])
with no effect
Upvotes: 0
Views: 171
Reputation: 32660
Use flatMap
to flatten the nested arrays then reduceByKey
to get the count of each word in the list:
list_of_lists = sc.parallelize(list_of_lists)
list_of_lists = list_of_lists.flatMap(lambda x: set(x))\
.map(lambda x: (x, 1))\
.reduceByKey(lambda a, b: a + b)
print(list_of_lists.collect())
# [('fox', 1), ('dragon', 1), ('ox', 1), ('dog', 2), ('cow', 3)]
Upvotes: 2