Reputation: 3063
let's say I have the following data Frame :
val a=Seq(("aa","b",1),("aa","c",5),("aa","d",0),("xx","y",5),("z","zz",9),("z","b",12)).toDF("name","tag","num").show
+----+---+---+
|name|tag|num|
+----+---+---+
| aa| b| 1|
| aa| c| 5|
| aa| d| 0|
| xx| y| 5|
| z| zz| 9|
| z| b| 12|
+----+---+---+
I want to filter this dataFrame so that :
for each group of data (grouped by name) if the column tag have the value 'b' I'll take the max value of the num column , otherwise I ignore the row
here is the out put that I want to have :
+----+---+---+
|name|tag|num|
+----+---+---+
| aa| c| 5|
| z| b| 12|
+----+---+---+
Explanation
Upvotes: 0
Views: 194
Reputation: 18098
Try this:
val df=Seq(("aa","b",1),("aa","c",5),("aa","d",0),("xx","y",5),("z","zz",9),("z","b",12)).toDF("name","tag","num")
df.createOrReplaceTempView("tab")
val res = spark.sql(""" with tw as (select t1.name, max(t1.num) as max_val
from tab t1
where t1.name in (select distinct t2.name
from tab t2
where t2.tag = 'b'
)
group by t1.name )
select distinct tz.name, tz.tag, tz.num
from tab tz, tw
where tz.name = tw.name
and tz.num = tw.max_val
""")
res.show(false)
Upvotes: 1