Reputation: 21
I am trying to get list of columns from below dataframe which has all the value as Null or 0 .
Col1 | Col2| Col3| Col4| Col5|
+-----+-----+------+------+------+
|Jamer| null| M| 0 | TW18|
|XYZAM| null| F| 0 | 0 |
|Nowra| null| null| 0 | null |
+-----+-----+------+------+------+
Expected Output : Col2 & Col4
If I am using PySpark Column Class isNull then the result also include Col3 & Col5 as output which is not expected. Any suggestion ?
Upvotes: 1
Views: 1374
Reputation: 798
I would count the nulls and compare the result to the total number of rows:
counted_nulls = df.select([(count(when(isnan(c) | col(c).isNull(), c)) == df.count()).alias(c) for c in df.columns])
Then, you can get the columns where all values are null like this:
[k for k,v in counted_nulls.collect()[0].asDict().items() if v is True]
Upvotes: 0
Reputation: 960
I recommend replacing all 0 (and "0") values to null and then running summary("count")
function to get the count of non-null values of each column. columns with 0 count mean all values are null (or zero replaced to null).
# create example
df = spark.createDataFrame(
[("Jamer", None, "M", 0, "TW18"),
("XYZAM", None, "F", 0, "0"),
("Nowra", None, None, 0, None)],
"col1 string, col2 string, col3 string, col4 int, col5 string"
)
# actual code
df.replace(0, None).replace("0", None) \
.summary("count") \
.show()
+-------+----+----+----+----+----+
|summary|col1|col2|col3|col4|col5|
+-------+----+----+----+----+----+
| count| 3| 0| 2| 0| 1|
+-------+----+----+----+----+----+
Upvotes: 1