minikomi
minikomi

Reputation: 8503

Reading a CSV file into spark with data containing commas in a quoted field

I have CSV data in a file (data.csv) like so:

lat,lon,data
35.678243, 139.744243, "0,1,2"
35.657285, 139.749380, "1,2,3"
35.594942, 139.548870, "4,5,6"
35.705331, 139.282869, "7,8,9"
35.344667, 139.228691, "10,11,12"

Using the following spark shell command:

spark.read.option("header", true).option("escape", "\"").csv("data.csv").show(false)

I'm getting the following output:

+---------+-----------+----+
|lat      |lon        |data|
+---------+-----------+----+
|35.678243| 139.744243| "0 |
|35.657285| 139.749380| "1 |
|35.594942| 139.548870| "4 |
|35.705331| 139.282869| "7 |
|35.344667| 139.228691| "10|
+---------+-----------+----+

I would expect the commas within the double quotes to be ignored in line with RFC 4180, but the parser is interpreting them as a delimiter.

Using the option quote also has no effect:

scala> spark.read.option("header", true).option("quote", "\"").option("escape", "\"").csv("data.csv").show(false)
+---------+-----------+----+
|lat      |lon        |data|
+---------+-----------+----+
|35.678243| 139.744243| "0 |
|35.657285| 139.749380| "1 |
|35.594942| 139.548870| "4 |
|35.705331| 139.282869| "7 |
|35.344667| 139.228691| "10|
+---------+-----------+----+

Nor does no options:

scala> spark.read.option("header", true).csv("data.csv").show(false)
+---------+-----------+----+
|lat      |lon        |data|
+---------+-----------+----+
|35.678243| 139.744243| "0 |
|35.657285| 139.749380| "1 |
|35.594942| 139.548870| "4 |
|35.705331| 139.282869| "7 |
|35.344667| 139.228691| "10|
+---------+-----------+----+

Upvotes: 1

Views: 331

Answers (1)

Tagar
Tagar

Reputation: 14891

Notice there is a space after delimiter (a comma ,).

This breaks quotation processing .

Spark 3.0 will allow to have multi-character delimiter , (a comma and a space in your case).

See https://issues.apache.org/jira/browse/SPARK-24540 for details.

Upvotes: 1

Related Questions