Baptiste Arnaud
Baptiste Arnaud

Reputation: 2760

Join specific rows in a RDD

I have a RDD like this:

[('anger', 166),
 ('lyon', 193),
 ('marseilles_1', 284),
 ('nice', 203),
 ('paris_2', 642),
 ('paris_3', 330),
 ('troyes', 214),
 ('marseilles_2', 231),
 ('nantes', 207),
 ('orlean', 196),
 ('paris_1', 596),
 ('rennes', 180),
 ('toulouse', 177)]

I need to merge paris_1, paris_2, paris_3 into one row called paris.

I strictly have no idea how to proceed and didn't find any answers.

Can you help me?

Upvotes: 0

Views: 56

Answers (1)

MaFF
MaFF

Reputation: 10096

You can use a regular expression to get city names from your current key values, then reduce by key:

import re 
rdd\
    .map(lambda l: (re.sub('[_0-9]', '',l[0]), l[1]))\
    .reduceByKey(lambda x,y: x + y)\

    [('anger', 166),
     ('lyon', 193),
     ('nice', 203),
     ('paris', 1568),
     ('troyes', 214),
     ('marseilles', 515),
     ('nantes', 207),
     ('orlean', 196),
     ('rennes', 180),
     ('toulouse', 177)]

Upvotes: 2

Related Questions