catq
catq

Reputation: 127

Spark sum up values regardless of keys

My list of tuples looks like this:

Tup = [(u'X45', 2), (u'W80', 1), (u'F03', 2), (u'X61', 2)]

I want to sum all values up, in this case, 2+1+2+2=7

I can use Tup.reduceByKey() in spark if keys are the same. But which function can I use in spark to sum all values up regardless the key?

I've tried Tup.sum() but it give me (u'X45', 2, u'W80', 1, u'F03', 2, u'X61', 2)

BTW Due to large dataset, I want to sum it up in RDD, so I don't use Tup.collect() and sum it up out of Spark.

Upvotes: 6

Views: 5600

Answers (1)

Knows Not Much
Knows Not Much

Reputation: 31586

This is pretty easy.

Conceptually, you should first map on your original RDD and extract the 2nd value. and then sum those

In Scala

val x = List(("X45", 2), ("W80", 1), ("F03", 2), ("X61", 2))
val rdd = sc.parallelize(x)
rdd.map(_._2).sum()

In Python

x = [(u'X45', 2), (u'W80', 1), (u'F03', 2), (u'X61', 2)]
rdd = sc.parallelize(x)
y = rdd.map(lambda x : x[1]).sum()

in both cases the sum of 7 is printed.

Upvotes: 10

Related Questions