Marco Fedele
Marco Fedele

Reputation: 2148

How to load CSV dataset with corrupted columns?

I've exported a client database to a csv file, and tried to import it to Spark using:

spark.sqlContext.read
  .format("csv")
  .option("header", "true")
  .option("inferSchema", "true")
  .load("table.csv")

After doing some validations, I find out that some ids were null because a column sometimes has a carriage return. And that dislocated all next columns, with a domino effect, corrupting all the data.

What is strange is that when calling printSchema the resulting table structure is good.

How to fix the issue?

Upvotes: 2

Views: 912

Answers (2)

Dan
Dan

Reputation: 79

I'm not sure what version of spark you are using, but beginning in 2.2 (I believe), there is a 'multiLine' option that can be used to keep fields together that have line breaks in them. From some other things I've read, you may need to apply some quoting and/or escape character options to get it working just how you want it.

spark.read
  .csv("table.csv")
  .option("header", "true")
  .option("inferSchema", "true")
  **.option("multiLine", "true")**

Upvotes: 0

Jacek Laskowski
Jacek Laskowski

Reputation: 74739

You seemed to have had a lot of luck with inferSchema that it worked fine (since it only reads few records to infer the schema) and so printSchema gives you a correct result.

Since the CSV export file is broken and assuming you want to process the file using Spark (given its size for example) read it using textFile and fix the ids. Save it as CSV format and load it back.

Upvotes: 3

Related Questions