How to load CSV dataset with corrupted columns?

Question

I've exported a client database to a csv file, and tried to import it to Spark using:

spark.sqlContext.read
  .format("csv")
  .option("header", "true")
  .option("inferSchema", "true")
  .load("table.csv")

After doing some validations, I find out that some ids were null because a column sometimes has a carriage return. And that dislocated all next columns, with a domino effect, corrupting all the data.

What is strange is that when calling printSchema the resulting table structure is good.

How to fix the issue?

Jacek Laskowski · Accepted Answer

You seemed to have had a lot of luck with inferSchema that it worked fine (since it only reads few records to infer the schema) and so printSchema gives you a correct result.

Since the CSV export file is broken and assuming you want to process the file using Spark (given its size for example) read it using textFile and fix the ids. Save it as CSV format and load it back.

How to load CSV dataset with corrupted columns?

Answers (2)

Related Questions