Reputation: 1593
val df = sc.parallelize(Seq((201601, a),
(201602, b),
(201603, c),
(201604, c),
(201607, c),
(201604, c),
(201608, c),
(201609, c),
(201605, b))).toDF("col1", "col2")
I want to get top 3 values of col1. Can any please let me know the better way to do this.
Spark : 1.6.2 Scala : 2.10
Upvotes: 1
Views: 10992
Reputation: 9
You can get same results in one more way using top function
Example:
val data=sc.parallelize(Seq(("maths",52),("english",75),("science",82), ("computer",65),("maths",85))).top(2)
Results:
(science,82)
(maths,85)
Upvotes: 0
Reputation: 214957
You can extract the maxDate firstly and then filter based on the maxDate:
val maxDate = df.agg(max("col1")).first().getAs[Int](0)
// maxDate: Int = 201609
def minusThree(date: Int): Int = {
var Year = date/100
var month = date%100
if(month <= 3) {
Year -= 1
month += 9
} else { month -= 3}
Year*100 + month
}
df.filter($"col1" > minusThree(maxDate)).show
+------+----+
| col1|col2|
+------+----+
|201607| c|
|201608| c|
|201609| c|
+------+----+
Upvotes: 1
Reputation: 15297
You can do it like below.
df.select($"col1").orderBy($"col1".desc).limit(3).show()
You will get
+------+
| col1|
+------+
|201609|
|201608|
|201607|
+------+
Upvotes: 4