Spark CSV Reader quoted numerics

Question

I am currently reading in CSV data using the following code:

Dataset dataset = getSparkSession().read()
            .option("header", "true")
            .option("quote", '"')
            .option("sep", ',')
            .schema(schema)
            .csv(path)
            .toDF();

Which is directed to a CSV file that has rows that look like this:

"abc","city","123"

as well as another file that has rows that look like this:

"abc","city",123

The second one works fine because the schema I pass is

string, string, long

the first one results in java.lang.NumberFormatException: For input string: "123"

Is it possible for the CSV reader to properly read CSVs in both valid formats? Assuming options are passed.

I am using Spark 2.1.1

Rick Moritz · Accepted Answer

Using your code actually crashes for me. I suspect that using characters instead of Strings is the culprit. Using '"'.toString for .option("quote",...) fixes the crash, and works. Furthermore, you may want to also define the escape character, as in the following code.

In Cloudera's Spark2, I was able to use the following to parse both quoted and unquoted numbers to DecimalType, with a pre-defined schema:

spark.read
            .option("mode", "FAILFAST")
            .option("escape", "\"")
            .option("delimiter", DELIMITER)
            .option("header", HASHEADER.toString)
            .option("quote", "\"")
            .option("nullValue", null)
            .option("ignoreLeadingWhiteSpace", value = true)
            .schema(SCHEMA)
            .csv(PATH)

Examples of parsed numbers (from unit tests):

1.0
11
"15.23"
""
 //empty field
"0.0000000001"
1111111111111.
 000000000. //with leading space

This also works in my tests for IntegerType - it can be parsed regardless of quotes.

Spark CSV Reader quoted numerics

Answers (2)

Related Questions