Mass17
Mass17

Reputation: 1605

RDD operation to sort values in pyspark

I have a file format as follow,

0, Alpha,-3.9, 4, 2001-02-01, 5, 20
0, Beta,-3.8, 3, 2002-02-01, 6, 21
1, Gamma,-3.7, 8, 2003-02-01, 7, 22
0, Alpha,-3.5, 4, 2004-02-01, 8, 23
0, Alpha,-3.9, 4, 2005-02-01, 8, 27

I want to sort distinct 1st elements in each line by 3rd elements using rdd operation. I prefer to get the following output,

(Beta, 3)
(Alpha, 4)
(Gamma, 8)

This is what I have done at the moment,

rdd = sc.textFile(myDataset)
list_ = rdd.map(lambda line: line.split(",")).map(lambda e : e[1]).distinct().collect() 
new_ = list_.sortBy(lambda e : e[2])

But I could not sort as I wanted. Could anyone tell how to do this only rdd based operation?

Upvotes: 0

Views: 519

Answers (1)

cozek
cozek

Reputation: 755

rdd = sc.textFile(myDataset) is correct.

list_ = rdd.map(lambda line: line.split(",")).map(lambda e : e[1]).distinct().collect() 
new_ = list_.sortBy(lambda e : e[2]) # e[2] does not exist.

You already called collect on list_, so it is no longer an RDD. Then you have proceeded to call sortBy on it, so it won't work. Perhaps you made this mistake while posting. The main issue is the map operation. You need to create a pairWiseRdd, but you have not created one. Hence, there is no e[2] to sort with. See below.

>>> rdd.map(lambda line: line.split(",")).map(lambda e : e[1]).collect()
[' Alpha', ' Beta', ' Gamma', ' Alpha', ' Alpha']

The above will not have the value you need to use the distinct() Instead, you need to do this

>>> list_ = rdd.map(lambda line: line.split(",")).map(lambda e : (e[1],e[3]))
>>> list_.collect()
[(' Alpha', ' 4'),
 (' Beta', ' 3'),
 (' Gamma', ' 8'),
 (' Alpha', ' 4'),
 (' Alpha', ' 4')]
>>> distinct_rdd = list_.distinct() #making stuff distinct
>>> distinct_rdd.collect()
[(' Alpha', ' 4'), (' Beta', ' 3'), (' Gamma', ' 8')]

Now that we have made our pairWiseRdd, we can use the second value of every pair to sort it.

>>> sorted_rdd = distinct_rdd.sortBy( lambda x:x[1] )
>>> sorted_rdd.collect()
[(' Beta', ' 3'), (' Alpha', ' 4'), (' Gamma', ' 8')]

Upvotes: 1

Related Questions