Reputation: 977
I have a Spark RDD as follows:
rdd = sc.parallelize([('X01','Y01'),
('X01','Y02'),
('X01','Y03'),
('X02','Y01'),
('X02','Y06')])
I would like to convert them into the following format:
[('X01',('Y01','Y02','Y03')),
('X02',('Y01','Y06'))]
Can someone help me how to achieve this using PySpark?
Upvotes: 0
Views: 355
Reputation: 1
as septra said, groupByKey methodis what you need. Further if you want to apply any operation on all the values to particular key then you can do the same with mapValues() method. This method will take one method(logic which you want to apply on grouped values) and apply to all the grouped values on per key. If you want both operation in one go, you can go for "reduceByKey" method. You can treat "reduceByKey() = groupByKey() + mapValues()"
Upvotes: 0
Reputation: 824
Convert the RDD to PairRDD using mapToPair(// with key as first column and the value will be rest of the record)
and do a groupByKey
on the resultant RDD.
Upvotes: 1
Reputation: 1753
A simple groupByKey
operation is what you need.
rdd.groupByKey().mapValues(lambda x: tuple(x.data)).collect()
Result: [('X02', ('Y01', 'Y06')), ('X01', ('Y01', 'Y02', 'Y03'))]
Upvotes: 1