Reputation: 198
I need to check if any of the columns of the dataframe are empty. Empty can be defined as if all rows of the column has values that are either null or empty string
Dataframe is as follows
+---+-------+-------+-------+-------+
| ID|Sample1|Sample2|Sample3|Sample4|
+---+-------+-------+-------+-------+
| 1| a1| b1| c1| null|
| 2| | | | |
| 3| a3| | | |
+---+-------+-------+-------+-------+
The code I use for the check
mainDF.select(mainDF.columns.map(c => sum((col(c).isNotNull && col(c)!="").cast("int")).alias(c)): _*).show()
What I get is
+---+-------+-------+-------+-------+
| ID|Sample1|Sample2|Sample3|Sample4|
+---+-------+-------+-------+-------+
| 3| 3| 3| 3| 2|
+---+-------+-------+-------+-------+
What I hope to get is
+---+-------+-------+-------+-------+
| ID|Sample1|Sample2|Sample3|Sample4|
+---+-------+-------+-------+-------+
| 3| 2| 1| 1| 0|
+---+-------+-------+-------+-------+
Also, My final results should be a true
or false
on if anyone of the columns is empty.
In this case it will be true
because count of Sample4
is 0
Upvotes: 0
Views: 49
Reputation: 2451
You could map correct values to 1 and empty strings/nulls to 0 and then perform sum
.
mainDF.select(mainDF.columns.map(c => sum(when(col(c)==="" or col(c).isNull,0).otherwise(1)).as(c)):_*).show()
+---+-------+-------+-------+-------+
| ID|Sample1|Sample2|Sample3|Sample4|
+---+-------+-------+-------+-------+
| 3| 2| 1| 1| 0|
+---+-------+-------+-------+-------+
Upvotes: 1