Multiple nullValues in spark csv

Question

I have a csv file with "" (empty value) and "N/A" and "-" all in the same files. I want them all to be read into the dataframe as nulls. I know that there is an option in spark-csv "nullValue" , which allows me to treat one single string as null. But for me, that is not sufficient for obvious reasons.

There is a pending issue from spark, https://github.com/databricks/spark-csv/issues/333

which is still open. I was wondering about the most elegent way to get around the problem.

combinatorist · Accepted Answer

Reposted from my comment:

Read the field into a dataframe as a string
make Null replacements there
convert the field to an int
then cast that dataframe as a dataset

Multiple nullValues in spark csv

Answers (2)

Related Questions