Reputation: 8503
I have CSV data in a file (data.csv
) like so:
lat,lon,data
35.678243, 139.744243, "0,1,2"
35.657285, 139.749380, "1,2,3"
35.594942, 139.548870, "4,5,6"
35.705331, 139.282869, "7,8,9"
35.344667, 139.228691, "10,11,12"
Using the following spark shell command:
spark.read.option("header", true).option("escape", "\"").csv("data.csv").show(false)
I'm getting the following output:
+---------+-----------+----+
|lat |lon |data|
+---------+-----------+----+
|35.678243| 139.744243| "0 |
|35.657285| 139.749380| "1 |
|35.594942| 139.548870| "4 |
|35.705331| 139.282869| "7 |
|35.344667| 139.228691| "10|
+---------+-----------+----+
I would expect the commas within the double quotes to be ignored in line with RFC 4180, but the parser is interpreting them as a delimiter.
Using the option quote
also has no effect:
scala> spark.read.option("header", true).option("quote", "\"").option("escape", "\"").csv("data.csv").show(false)
+---------+-----------+----+
|lat |lon |data|
+---------+-----------+----+
|35.678243| 139.744243| "0 |
|35.657285| 139.749380| "1 |
|35.594942| 139.548870| "4 |
|35.705331| 139.282869| "7 |
|35.344667| 139.228691| "10|
+---------+-----------+----+
Nor does no options:
scala> spark.read.option("header", true).csv("data.csv").show(false)
+---------+-----------+----+
|lat |lon |data|
+---------+-----------+----+
|35.678243| 139.744243| "0 |
|35.657285| 139.749380| "1 |
|35.594942| 139.548870| "4 |
|35.705331| 139.282869| "7 |
|35.344667| 139.228691| "10|
+---------+-----------+----+
Upvotes: 1
Views: 331
Reputation: 14891
Notice there is a space after delimiter (a comma ,
).
This breaks quotation processing .
Spark 3.0 will allow to have multi-character delimiter ,
(a comma and a space in your case).
See https://issues.apache.org/jira/browse/SPARK-24540 for details.
Upvotes: 1