Reputation: 277
I need to be able to return a list of values from (key,value) pairs from an RDD while maintaining original order.
I've included my workaround below but I'd like to be able to do it all in one go.
Something like:
myRDD = [(1, 2582), (3, 3222), (4, 4190), (5, 2502), (6, 2537)]
values = myRDD.<insert PySpark method(s)>
print values
>>>[2582, 3222, 4190, 2502, 2537]
My workaround:
myRDD = [(1, 2582), (3, 3222), (4, 4190), (5, 2502), (6, 2537)]
values = []
for item in myRDD.sortByKey(True).collect():
newlist.append(item[1])
print values
>>>[2582, 3222, 4190, 2502, 2537]
Thanks!
Upvotes: 2
Views: 17649
Reputation: 330093
If by "original order" you mean order of the keys then all you have to do is add map after the sort:
myRDD.sortByKey(ascending=True).map(lambda (k, v): v).collect()
or to call values
method:
myRDD.sortByKey(ascending=True).values().collect()
If you refer to the order of the values in a structure which has been used to create initial RDD then it is impossible without storying additional information. RDDs are unordered, unless you explicitly apply transformations like sortBy
.
Upvotes: 10