Pyspark counting the occurance of values with keys

Question

I have a list of (key,value) pairs of the form:

x=[(('cat','dog),('a','b')),(('cat','dog'),('a','b')),(('mouse','rat'),('e','f'))]

I want to count the number of times each value tuple appears with the key tuple.

Desired output:

[(('cat','dog'),('a','b',2)),(('mouse','rat'),('e','f',1))]

A working solution is:

xs=sc.parallelize(x)
xs=xs.groupByKey()
xs=xs.map(lambda (x,y):(x,Counter(y))

however for large datasets, this method fills up the disk space (~600GB). I was trying to implement a similar solution using reduceByKey:

xs=xs.reduceByKey(Counter).collect()

but I get the following error:

TypeError: __init__() takes at most 2 arguments (3 given)

Katya Willard · Accepted Answer

Here is how I usually do it:

xs=sc.parallelize(x)
a = xs.map(lambda x: (x, 1)).reduceByKey(lambda a,b: a+b)

a.collect() yields:

[((('mouse', 'rat'), ('e', 'f')), 1), ((('cat', 'dog'), ('a', 'b')), 2)]

I'm going to assume that you want the counts (here, 1 and 2) inside the second key in the (key1, key2) pair.

To achieve that, try this:

a.map(lambda x: (x[0][0], x[0][1] + (x[1],))).collect()

The last step basically remaps it so that you get the first key pair (like ('mouse','rat')), then takes the second key pair (like ('e','f')), and then adds the tuple version of b[1], which is the count, to the second key pair.

Pyspark counting the occurance of values with keys

Answers (1)

Related Questions