Spark: Row filter based on Column value

Question

I have millions of rows as dataframe like this:

val df = Seq(("id1", "ACTIVE"), ("id1", "INACTIVE"), ("id1", "INACTIVE"), ("id2", "ACTIVE"), ("id3", "INACTIVE"), ("id3", "INACTIVE")).toDF("id", "status")

scala> df.show(false)
+---+--------+
|id |status  |
+---+--------+
|id1|ACTIVE  |
|id1|INACTIVE|
|id1|INACTIVE|
|id2|ACTIVE  |
|id3|INACTIVE|
|id3|INACTIVE|
+---+--------+

Now I want to divide this data into three separate dataFrame like this:

Only ACTIVE ids (like id2), say activeDF
Only INACTIVE ids (like id3), say inactiveDF
Having both ACTIVE and INACTIVE as status, say bothDF

How can I calculate activeDF and inactiveDF?

I know that bothDF can be calculated like

df.select("id").distinct.except(activeDF).except(inactiveDF)

, but this will involve shuffling (as 'distinct' operation required same). Is there any better way to calculate bothDF

Versions:

Spark : 2.2.1
Scala : 2.11

C.S.Reddy Gadipally · Accepted Answer

just another way - groupBy, collect as set and then if the size of the set is 1, it is either active or inactive only, else both

scala> val df = Seq(("id1", "ACTIVE"), ("id1", "INACTIVE"), ("id1", "INACTIVE"), ("id2", "ACTIVE"), ("id3", "INACTIVE"), ("id3", "INACTIVE"), ("id4", "ACTIVE"), ("id5", "ACTIVE"), ("id6", "INACTIVE"), ("id7", "ACTIVE"), ("id7", "INACTIVE")).toDF("id", "status")
df: org.apache.spark.sql.DataFrame = [id: string, status: string]

scala> df.show(false)
+---+--------+
|id |status  |
+---+--------+
|id1|ACTIVE  |
|id1|INACTIVE|
|id1|INACTIVE|
|id2|ACTIVE  |
|id3|INACTIVE|
|id3|INACTIVE|
|id4|ACTIVE  |
|id5|ACTIVE  |
|id6|INACTIVE|
|id7|ACTIVE  |
|id7|INACTIVE|
+---+--------+


scala> val allstatusDF = df.groupBy("id").agg(collect_set("status") as "allstatus")
allstatusDF: org.apache.spark.sql.DataFrame = [id: string, allstatus: array]

scala> allstatusDF.show(false)
+---+------------------+
|id |allstatus         |
+---+------------------+
|id7|[ACTIVE, INACTIVE]|
|id3|[INACTIVE]        |
|id5|[ACTIVE]          |
|id6|[INACTIVE]        |
|id1|[ACTIVE, INACTIVE]|
|id2|[ACTIVE]          |
|id4|[ACTIVE]          |
+---+------------------+


scala> allstatusDF.withColumn("status", when(size($"allstatus") === 1, $"allstatus".getItem(0)).otherwise("BOTH")).show(false)
+---+------------------+--------+
|id |allstatus         |status  |
+---+------------------+--------+
|id7|[ACTIVE, INACTIVE]|BOTH    |
|id3|[INACTIVE]        |INACTIVE|
|id5|[ACTIVE]          |ACTIVE  |
|id6|[INACTIVE]        |INACTIVE|
|id1|[ACTIVE, INACTIVE]|BOTH    |
|id2|[ACTIVE]          |ACTIVE  |
|id4|[ACTIVE]          |ACTIVE  |
+---+------------------+--------+

Spark: Row filter based on Column value

Answers (2)

Related Questions