AWDn0n
AWDn0n

Reputation: 73

How to get the numeric value of missing values in a PySpark column?

I am working with the OpenFoodFacts dataset using PySpark. There's quite a lot of columns which are entirely made up of missing values and I want to drop said columns. I have been looking up ways to retrieve the number of missing values on each column, but they are displayed in a table format instead of actually giving me the numeric value of the total null values.

The following code shows the number of missing values in a column but displays it in a table format:

from pyspark.sql.functions import col, isnan, when, count
data.select([count(when(isnan("column") | col("column").isNull(), "column")]).show()

I have tried the following codes:

for c in data.columns:
    if(data.select([count(when(isnan(c) | col(c).isNull(), c)]) == data.count()):
        data = data.drop(c)

data.show()
for c in data.columns:
    if(data.filter(data[c].isNull()).count() == data.count()):
        data = data.drop(c)

data.show()

Is there a way to get ONLY the number? Thanks

Upvotes: 0

Views: 585

Answers (1)

Jonathan
Jonathan

Reputation: 2043

If you need the number instead of showing in the table format, you need to use the .collect(), which is:

list_of_values = data.select([count(when(isnan("column") | col("column").isNull(), "column")]).collect()

What you get is a list of Row, which contain all the information in the table.

Upvotes: 1

Related Questions