lagunazul
lagunazul

Reputation: 277

PySpark - sortByKey() method to return values from k,v pairs in their original order

I need to be able to return a list of values from (key,value) pairs from an RDD while maintaining original order.

I've included my workaround below but I'd like to be able to do it all in one go.

Something like:

myRDD = [(1, 2582), (3, 3222), (4, 4190), (5, 2502), (6, 2537)]
values = myRDD.<insert PySpark method(s)>
print values
>>>[2582, 3222, 4190, 2502, 2537]

My workaround:

myRDD = [(1, 2582), (3, 3222), (4, 4190), (5, 2502), (6, 2537)]

values = []
for item in myRDD.sortByKey(True).collect():
                 newlist.append(item[1])
print values
>>>[2582, 3222, 4190, 2502, 2537]

Thanks!

Upvotes: 2

Views: 17649

Answers (1)

zero323
zero323

Reputation: 330093

If by "original order" you mean order of the keys then all you have to do is add map after the sort:

myRDD.sortByKey(ascending=True).map(lambda (k, v): v).collect()

or to call values method:

myRDD.sortByKey(ascending=True).values().collect()

If you refer to the order of the values in a structure which has been used to create initial RDD then it is impossible without storying additional information. RDDs are unordered, unless you explicitly apply transformations like sortBy.

Upvotes: 10

Related Questions