Choosing one row using a flag from group by

Question

So, i have data looking like this

  id   |   tag   | flag |  num
   1   |  tag_1  |   Y  |  100
   1   |  tag_2  |   N  |  200
   2   |  tag_3  |   N  |  100
   3   |  tag_4  |   N  |  300
   3   |  tag_5  |   Y  |  200

i need to group by id, sum the num, and if the row is more than 1, i want to choose data from row with flag = Y, otherwise use whatever row available. so the result is looking like this

  id   |   tag   | flag |  num
   1   |  tag_1  |   Y  |  300
   2   |  tag_3  |   N  |  100
   3   |  tag_5  |   Y  |  500

there's only one Y and N for each group of id, so there will never be 2 rows or more with flag Y with same id.

is there anyway to do this in spark scala?

philantrovert · Accepted Answer

As you have suggested that only 1 row will have the flag set to Y. The following should work:

val df = Seq(
(1 ,"tag_1","Y",100),
(1 ,"tag_2","N",200),
(2 ,"tag_3","N",100),
(3 ,"tag_4","N",300),
(3 ,"tag_5","Y",200)).toDF("id", "tag", "flag", "num")

 df.show
//+---+-----+----+---+
//| id|  tag|flag|num|
//+---+-----+----+---+
//|  1|tag_1|   Y|100|
//|  1|tag_2|   N|200|
//|  2|tag_3|   N|100|
//|  3|tag_4|   N|300|
//|  3|tag_5|   Y|200|
//+---+-----+----+---+


val joined = df.as("l").join(
                df.groupBy($"id").agg(sum($"num").as("num"), count($"*").as("cnt")).as("r"),
                $"l.id" === $"r.id")
               .select($"l.id", $"l.tag", $"l.flag", $"r.num", $"r.cnt")

//+---+-----+----+---+---+
//|id |tag  |flag|num|cnt|
//+---+-----+----+---+---+
//|1  |tag_1|Y   |300|2  |
//|1  |tag_2|N   |300|2  |
//|2  |tag_3|N   |100|1  |
//|3  |tag_4|N   |500|2  |
//|3  |tag_5|Y   |500|2  |
//+---+-----+----+---+---+

joined.withColumn("filtered", 
    when($"cnt" > lit(1) && $"flag" === lit("Y"), lit("y"))
    .when($"cnt" > lit(1) && $"flag" === lit("N"), lit("n"))
    .otherwise(lit("y"))).where($"filtered" === lit("y"))
    .drop("filtered", "cnt")
    
//+---+-----+----+---+
//| id|  tag|flag|num|
//+---+-----+----+---+
//|  1|tag_1|   Y|300|
//|  2|tag_3|   N|100|
//|  3|tag_5|   Y|500|
//+---+-----+----+---+

Choosing one row using a flag from group by

Answers (1)

Related Questions