Reputation: 113
So, i have data looking like this
id | tag | flag | num
1 | tag_1 | Y | 100
1 | tag_2 | N | 200
2 | tag_3 | N | 100
3 | tag_4 | N | 300
3 | tag_5 | Y | 200
i need to group by id
, sum the num
, and if the row is more than 1, i want to choose data from row with flag = Y
, otherwise use whatever row available. so the result is looking like this
id | tag | flag | num
1 | tag_1 | Y | 300
2 | tag_3 | N | 100
3 | tag_5 | Y | 500
there's only one Y
and N
for each group of id
, so there will never be 2 rows or more with flag Y
with same id.
is there anyway to do this in spark scala?
Upvotes: 0
Views: 35
Reputation: 10092
As you have suggested that only 1 row will have the flag set to Y
. The following should work:
val df = Seq(
(1 ,"tag_1","Y",100),
(1 ,"tag_2","N",200),
(2 ,"tag_3","N",100),
(3 ,"tag_4","N",300),
(3 ,"tag_5","Y",200)).toDF("id", "tag", "flag", "num")
df.show
//+---+-----+----+---+
//| id| tag|flag|num|
//+---+-----+----+---+
//| 1|tag_1| Y|100|
//| 1|tag_2| N|200|
//| 2|tag_3| N|100|
//| 3|tag_4| N|300|
//| 3|tag_5| Y|200|
//+---+-----+----+---+
val joined = df.as("l").join(
df.groupBy($"id").agg(sum($"num").as("num"), count($"*").as("cnt")).as("r"),
$"l.id" === $"r.id")
.select($"l.id", $"l.tag", $"l.flag", $"r.num", $"r.cnt")
//+---+-----+----+---+---+
//|id |tag |flag|num|cnt|
//+---+-----+----+---+---+
//|1 |tag_1|Y |300|2 |
//|1 |tag_2|N |300|2 |
//|2 |tag_3|N |100|1 |
//|3 |tag_4|N |500|2 |
//|3 |tag_5|Y |500|2 |
//+---+-----+----+---+---+
joined.withColumn("filtered",
when($"cnt" > lit(1) && $"flag" === lit("Y"), lit("y"))
.when($"cnt" > lit(1) && $"flag" === lit("N"), lit("n"))
.otherwise(lit("y"))).where($"filtered" === lit("y"))
.drop("filtered", "cnt")
//+---+-----+----+---+
//| id| tag|flag|num|
//+---+-----+----+---+
//| 1|tag_1| Y|300|
//| 2|tag_3| N|100|
//| 3|tag_5| Y|500|
//+---+-----+----+---+
Upvotes: 1