Vishnu Prathish
Vishnu Prathish

Reputation: 369

Multiple nullValues in spark csv

I have a csv file with "" (empty value) and "N/A" and "-" all in the same files. I want them all to be read into the dataframe as nulls. I know that there is an option in spark-csv "nullValue" , which allows me to treat one single string as null. But for me, that is not sufficient for obvious reasons.

There is a pending issue from spark, https://github.com/databricks/spark-csv/issues/333

which is still open. I was wondering about the most elegent way to get around the problem.

Upvotes: 1

Views: 1232

Answers (2)

sudip modi
sudip modi

Reputation: 11

For those who cant get it to work on databricks community edition notebook, You probably haven't mentioned the filename.

Upvotes: 0

combinatorist
combinatorist

Reputation: 566

Reposted from my comment:

  • Read the field into a dataframe as a string
  • make Null replacements there
  • convert the field to an int
  • then cast that dataframe as a dataset

Upvotes: 3

Related Questions