Reputation: 77
I have dataframe as follows:
Tags Place Count
Sales New Jersey 200
Sales Hong Kong 200
Sales Florida 200
Trade New York 150
Trade San Jose 150
Trade New Jersey 150
Market New Jersey 50
Market Michigan 50
Market Denver 50
As you can see that the tags are already sorted by "Count" in this table. I want to get first n tags from each group where group is "Tags"
Lets say I get first 2 then the resulting dataframe should be like this:
Tags Place Count
Sales New Jersey 200
Sales Hong Kong 200
Trade New York 150
Trade San Jose 150
Market New Jersey 50
Market Michigan 50
How can I do this in Spark SQL?
Upvotes: 2
Views: 1360
Reputation: 1409
below is an answer
you need to create data frame using HiveContext
import org.apache.spark.sql.hive.HiveContext
val hivecontext = new HiveContext(sc)
val df= hivecontext.createDataFrame(data,schema)
df.registerTempTable("df")
hivecontext.sql("SELECT tag,place,count FROM (SELECT tag,place,count,ROW_NUMBER() OVER (PARTITION BY tag) as rank FROM df) tmp WHERE rank <= 2 ORDER BY count DESC").show(false)
or
import org.apache.spark.sql.functions.row_number
import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy(df("tag"))
val rankDesc = row_number().over(w).alias("rank")
df.select($"*", rankDesc).filter($"rank" <= 2 || $"rank" <= 2).orderBy($"count".desc).drop($"rank").show(false)
output :
+------+----------+-----+ |tag |place |count| +------+----------+-----+ |Sales |Hong Kong |200 | |Sales |New Jersey|200 | |Trade |New York |150 | |Trade |San Jose |150 | |Market|New Jersey|50 | |Market|Michigan |50 | +------+----------+-----+
Upvotes: 1