user14913431
user14913431

Reputation: 45

PySpark sort values

I have a data:

[(u'ab', u'cd'),
 (u'ef', u'gh'),
 (u'cd', u'ab'),
 (u'ab', u'gh'),
 (u'ab', u'cd')]

I would like to do a mapreduce on this data and to find out how often same pairs appear.

As a result I get:

[((u'ab', u'cd'), 2),
 ((u'cd', u'ab'), 1),
 ((u'ab', u'gh'), 1),
 ((u'ef', u'gh'), 1)]

As you can see it is not quire right as (u'ab', u'cd') has to be 3 instead of 2 because (u'cd', u'ab') is the same pair.

My question is how can I make the program to count (u'cd', u'ab') and (u'ab', u'cd') as the same pair? I was thinking about sorting values for each row but could not find any solution for this.

Upvotes: 0

Views: 145

Answers (2)

blackbishop
blackbishop

Reputation: 32710

You can sort the values then use reduceByKey to count the pairs:

rdd1 = rdd.map(lambda x: (tuple(sorted(x)), 1))\
    .reduceByKey(lambda a, b: a + b)

rdd1.collect()
# [(('ab', 'gh'), 1), (('ef', 'gh'), 1), (('ab', 'cd'), 3)]

Upvotes: 1

mck
mck

Reputation: 42422

You can key by the sorted element, and count by key:

result = rdd.keyBy(lambda x: tuple(sorted(x))).countByKey()

print(result)
# defaultdict(<class 'int'>, {('ab', 'cd'): 3, ('ef', 'gh'): 1, ('ab', 'gh'): 1})

To convert the result into a list, you can do:

result2 = sorted(result.items())

print(result2)
# [(('ab', 'cd'), 3), (('ab', 'gh'), 1), (('ef', 'gh'), 1)]

Upvotes: 0

Related Questions