Ajay
Ajay

Reputation: 198

Check is anyone of the dataframe columns are empty

I need to check if any of the columns of the dataframe are empty. Empty can be defined as if all rows of the column has values that are either null or empty string

Dataframe is as follows

+---+-------+-------+-------+-------+
| ID|Sample1|Sample2|Sample3|Sample4|
+---+-------+-------+-------+-------+
|  1|     a1|     b1|     c1|   null|
|  2|       |       |       |       |
|  3|     a3|       |       |       |
+---+-------+-------+-------+-------+

The code I use for the check

mainDF.select(mainDF.columns.map(c => sum((col(c).isNotNull && col(c)!="").cast("int")).alias(c)): _*).show()

What I get is

+---+-------+-------+-------+-------+
| ID|Sample1|Sample2|Sample3|Sample4|
+---+-------+-------+-------+-------+
|  3|      3|      3|      3|      2|
+---+-------+-------+-------+-------+

What I hope to get is

+---+-------+-------+-------+-------+
| ID|Sample1|Sample2|Sample3|Sample4|
+---+-------+-------+-------+-------+
|  3|      2|      1|      1|      0|
+---+-------+-------+-------+-------+

Also, My final results should be a true or false on if anyone of the columns is empty. In this case it will be true because count of Sample4 is 0

Upvotes: 0

Views: 49

Answers (1)

chlebek
chlebek

Reputation: 2451

You could map correct values to 1 and empty strings/nulls to 0 and then perform sum.

mainDF.select(mainDF.columns.map(c => sum(when(col(c)==="" or col(c).isNull,0).otherwise(1)).as(c)):_*).show()

+---+-------+-------+-------+-------+
| ID|Sample1|Sample2|Sample3|Sample4|
+---+-------+-------+-------+-------+
|  3|      2|      1|      1|      0|
+---+-------+-------+-------+-------+

Upvotes: 1

Related Questions