Reshaping Spark RDD

I have a Spark RDD as follows:

rdd = sc.parallelize([('X01','Y01'),
                   ('X01','Y02'),
                   ('X01','Y03'),
                   ('X02','Y01'),
                   ('X02','Y06')])

I would like to convert them into the following format:

[('X01',('Y01','Y02','Y03')),
 ('X02',('Y01','Y06'))]

Can someone help me how to achieve this using PySpark?

Upvotes: 0

Views: 355

Answers (3)

Kapil Kumar
Kapil Kumar

Reputation: 1

as septra said, groupByKey methodis what you need. Further if you want to apply any operation on all the values to particular key then you can do the same with mapValues() method. This method will take one method(logic which you want to apply on grouped values) and apply to all the grouped values on per key. If you want both operation in one go, you can go for "reduceByKey" method. You can treat "reduceByKey() = groupByKey() + mapValues()"

Upvotes: 0

Aviral Kumar
Aviral Kumar

Reputation: 824

Convert the RDD to PairRDD using mapToPair(// with key as first column and the value will be rest of the record) and do a groupByKey on the resultant RDD.

Upvotes: 1

zenofsahil
zenofsahil

Reputation: 1753

A simple groupByKey operation is what you need.

rdd.groupByKey().mapValues(lambda x: tuple(x.data)).collect()

Result: [('X02', ('Y01', 'Y06')), ('X01', ('Y01', 'Y02', 'Y03'))]

Upvotes: 1

Related Questions