ziedTn
ziedTn

Reputation: 262

Check for empty row within spark dataframe?

Running over several csv files and i am trying to run and do some checks and for some reason for one file i am getting a NullPointerException and i am suspecting that there are some empty row.

So i am running the following and for some reason it gives me an OK output:

check_empty = lambda row : not any([False if k is None else True for k in row])
check_empty_udf = sf.udf(check_empty, BooleanType())
df.filter(check_empty_udf(sf.struct([col for col in df.columns]))).show()

I am missing something within the filter function or we can't extract empty rows from dataframes.

Upvotes: 1

Views: 4403

Answers (2)

Andrew F
Andrew F

Reputation: 2950

You could use df.dropna() to drop empty rows and then compare the counts.

Something like

df_clean = df.dropna()
num_empty_rows = df.count() - df_clean.count()

Upvotes: 3

shriyog
shriyog

Reputation: 958

You could use an inbuilt option for dealing with such scenarios.

val df = spark.read
     .format("csv")
     .option("header", "true")
     .option("mode", "DROPMALFORMED") // Drop empty/malformed rows
     .load("hdfs:///path/file.csv")

Check this reference - https://docs.databricks.com/spark/latest/data-sources/read-csv.html#reading-files

Upvotes: 0

Related Questions