Reputation: 1059
I have a list of (key,value)
pairs of the form:
x=[(('cat','dog),('a','b')),(('cat','dog'),('a','b')),(('mouse','rat'),('e','f'))]
I want to count the number of times each value tuple appears with the key tuple.
Desired output:
[(('cat','dog'),('a','b',2)),(('mouse','rat'),('e','f',1))]
A working solution is:
xs=sc.parallelize(x)
xs=xs.groupByKey()
xs=xs.map(lambda (x,y):(x,Counter(y))
however for large datasets, this method fills up the disk space (~600GB). I was trying to implement a similar solution using reduceByKey
:
xs=xs.reduceByKey(Counter).collect()
but I get the following error:
TypeError: __init__() takes at most 2 arguments (3 given)
Upvotes: 1
Views: 6624
Reputation: 2182
Here is how I usually do it:
xs=sc.parallelize(x)
a = xs.map(lambda x: (x, 1)).reduceByKey(lambda a,b: a+b)
a.collect()
yields:
[((('mouse', 'rat'), ('e', 'f')), 1), ((('cat', 'dog'), ('a', 'b')), 2)]
I'm going to assume that you want the counts (here, 1 and 2) inside the second key in the (key1, key2) pair.
To achieve that, try this:
a.map(lambda x: (x[0][0], x[0][1] + (x[1],))).collect()
The last step basically remaps it so that you get the first key pair (like ('mouse','rat')
), then takes the second key pair (like ('e','f')
), and then adds the tuple
version of b[1]
, which is the count, to the second key pair.
Upvotes: 6