Reputation: 2148
I've exported a client database to a csv file, and tried to import it to Spark using:
spark.sqlContext.read
.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.load("table.csv")
After doing some validations, I find out that some ids were null
because a column sometimes has a carriage return. And that dislocated all next columns, with a domino effect, corrupting all the data.
What is strange is that when calling printSchema
the resulting table structure is good.
How to fix the issue?
Upvotes: 2
Views: 912
Reputation: 79
I'm not sure what version of spark you are using, but beginning in 2.2 (I believe), there is a 'multiLine' option that can be used to keep fields together that have line breaks in them. From some other things I've read, you may need to apply some quoting and/or escape character options to get it working just how you want it.
spark.read
.csv("table.csv")
.option("header", "true")
.option("inferSchema", "true")
**.option("multiLine", "true")**
Upvotes: 0
Reputation: 74739
You seemed to have had a lot of luck with inferSchema
that it worked fine (since it only reads few records to infer the schema) and so printSchema
gives you a correct result.
Since the CSV export file is broken and assuming you want to process the file using Spark (given its size for example) read it using textFile
and fix the ids. Save it as CSV format and load it back.
Upvotes: 3