Reputation: 5582

Spark (pyspark) how to reduceByKey on only 2 element of a tuple of 3 element

I have the result of map that looks like this

[ ('success', '', 1), ('success', '', 1), ('error', 'something_random', 1), ('error','something_random', 1), ('error', 'something_random', 1) ]

Is there a way with a reduce by key to endup as:

[ ('success', 2), ('error', 3) ]

and then somehow print on a file all the errors ?

Upvotes: 0

Answers (1)

Reputation: 214957

Here are two options to get the result you need:

1) convert the 3 element tuple to 2 element tuple then use reduceByKey:

rdd.map(lambda x: (x[0], x[2])).reduceByKey(lambda x, y: x + y).collect()
# [('success', 2), ('error', 3)]

2) groupBy the first element of tuple, then sum up the values (third element) for each group using mapValues:

rdd.groupBy(lambda x: x[0]).mapValues(lambda g: sum(x for _,_,x in g)).collect()
# [('success', 2), ('error', 3)]

Upvotes: 6