AAG
AAG

Reputation: 33

How to append elements to a list by using reduceByKey in pyspark

I'm kind of stuck trying to solve a problem in pyspark. After do same calculations by using map function, I have a RDD that contains a list of dicts in this way:

[{key1: tuple1}, {key1: tuple2}....{key2: tuple1}, {keyN: tupleN}] 

I pretend to append for each key a list with all the tuples with the same key, obtaining something like:

[{key1: [tuple1, tuple2, tuple3...]}, {key2: [tuple1, tuple2....]}] 

I think an example it's more illustrative:

[{0: (0, 1.0)}, {0: (1, 0.0)}, {1: (0, 0.0)}, {1: (1, 1.0)}, {2:(0,0.0)}... ]

And I would like to obtain list of dicts like this:

[{0: [(0, 1.0), (1, 0.0)}, {1: [(0, 0.0), (1, 1.0)]}, {2:[(0,0.0),...]},...]

I'm trying to avoid using "combineByKey" function because it lasts too much time, there is any possibility to do that with "reduceByKey"??

Thanks you all very much.

Upvotes: 1

Views: 1019

Answers (1)

BPL
BPL

Reputation: 9863

Here's a possible solution without using reduceByKey but just python builtin functions:

from collections import defaultdict


inp = [{0: (0, 1.0)}, {0: (1, 0.0)}, {1: (0, 0.0)},
       {1: (1, 1.0)}, {2: (0, 0.0)}]

out = defaultdict(list)

for v in inp:
    for k, v1 in v.iteritems():
        out[k].append(v1)

out = [{k: v} for k, v in out.iteritems()]
print out

Upvotes: 1

Related Questions