Mohammad Derakhshan
Mohammad Derakhshan

Reputation: 1572

Sorting values of an array type in RDD using pySpark

I have an RDD containing values like this:

[
   (Key1, ([2,1,4,3,5],5)),
   (Key2, ([6,4,3,5,2],5)),
   (Key3, ([14,12,13,10,15],5)),
]

and I need to sort the value of the array part just like this:

[
   (Key1, ([1,2,3,4,5],5)),
   (Key2, ([2,3,4,5,6],5)),
   (Key3, ([10,12,13,14,15],5)),
]

I find two sorting methods for Spark: sortBy and sortbyKey. I tried the sortBy method like this:

myRDD.sortBy(lambda x: x[1][0])

But unfortunately, it sort data based on the first element of the array instead of sorting the elements of the array per se.

Also, the sortByKey seems not to help cause it just sorts the data based on the keys.

How can I achieve the sorted RDD?

Upvotes: 2

Views: 671

Answers (1)

Ged
Ged

Reputation: 18108

Try something like this:

rdd2 = rdd.map(lambda x: (x[0], sorted(x[1]), x[2]  ))

Upvotes: 2

Related Questions