Curtis Chong
Curtis Chong

Reputation: 811

How to Sum values of Column Within RDD

I have an RDD with the following rows:

[(id,value)]

How would you sum the values of all rows in the RDD?

Upvotes: 1

Views: 21138

Answers (1)

OneCricketeer
OneCricketeer

Reputation: 191728

Simply use sum, you just need to get the data into a list.

For example

sc.parallelize([('id', [1, 2, 3]), ('id2', [3, 4, 5])]) \ 
    .flatMap(lambda tup: tup[1]) \ # [1, 2, 3, 3, 4, 5]
    .sum()

Outputs 18

Similarly, just use values() to get that second column as an RDD on it's own.

sc.parallelize([('id', 6), ('id2', 12)]) \ 
    .values() \ # [6, 12]
    .sum()

Upvotes: 5

Related Questions