aName
aName

Reputation: 3063

how to filter depending on a value after grouping by in spark

let's say I have the following data Frame :

val a=Seq(("aa","b",1),("aa","c",5),("aa","d",0),("xx","y",5),("z","zz",9),("z","b",12)).toDF("name","tag","num").show
+----+---+---+
|name|tag|num|
+----+---+---+
|  aa|  b|  1|
|  aa|  c|  5|
|  aa|  d|  0|
|  xx|  y|  5|
|   z| zz|  9|
|   z|  b| 12|
+----+---+---+

I want to filter this dataFrame so that :

for each group of data (grouped by name) if the column tag have the value 'b' I'll take the max value of the num column , otherwise I ignore the row

here is the out put that I want to have :

+----+---+---+
|name|tag|num|
+----+---+---+
|  aa|  c|  5|
|   z|  b| 12|
+----+---+---+

Explanation

Upvotes: 0

Views: 194

Answers (1)

Ged
Ged

Reputation: 18098

Try this:

val df=Seq(("aa","b",1),("aa","c",5),("aa","d",0),("xx","y",5),("z","zz",9),("z","b",12)).toDF("name","tag","num")
df.createOrReplaceTempView("tab")

val res = spark.sql(""" with tw as (select t1.name, max(t1.num) as max_val
                          from tab t1 
                         where t1.name in (select distinct t2.name 
                                             from tab t2
                                            where t2.tag = 'b'
                                          )
                      group by t1.name )
                      select distinct tz.name, tz.tag, tz.num
                        from tab tz, tw
                       where tz.name = tw.name
                         and tz.num  = tw.max_val
                   """) 
res.show(false)

Upvotes: 1

Related Questions