rotsner
rotsner

Reputation: 692

Spark CSV Reader quoted numerics

I am currently reading in CSV data using the following code:

Dataset<Row> dataset = getSparkSession().read()
            .option("header", "true")
            .option("quote", '"')
            .option("sep", ',')
            .schema(schema)
            .csv(path)
            .toDF();

Which is directed to a CSV file that has rows that look like this:

"abc","city","123"

as well as another file that has rows that look like this:

"abc","city",123

The second one works fine because the schema I pass is

string, string, long

the first one results in java.lang.NumberFormatException: For input string: "123"

Is it possible for the CSV reader to properly read CSVs in both valid formats? Assuming options are passed.

I am using Spark 2.1.1

Upvotes: 1

Views: 962

Answers (2)

Rick Moritz
Rick Moritz

Reputation: 1518

Using your code actually crashes for me. I suspect that using characters instead of Strings is the culprit. Using '"'.toString for .option("quote",...) fixes the crash, and works. Furthermore, you may want to also define the escape character, as in the following code.

In Cloudera's Spark2, I was able to use the following to parse both quoted and unquoted numbers to DecimalType, with a pre-defined schema:

spark.read
            .option("mode", "FAILFAST")
            .option("escape", "\"")
            .option("delimiter", DELIMITER)
            .option("header", HASHEADER.toString)
            .option("quote", "\"")
            .option("nullValue", null)
            .option("ignoreLeadingWhiteSpace", value = true)
            .schema(SCHEMA)
            .csv(PATH)

Examples of parsed numbers (from unit tests):

1.0
11
"15.23"
""
 //empty field
"0.0000000001"
1111111111111.
 000000000. //with leading space

This also works in my tests for IntegerType - it can be parsed regardless of quotes.

Upvotes: 1

Varadha31590
Varadha31590

Reputation: 381

Use inferSchema property which automatically identifies the data type of the columns.

var data= sparkSession.read
      .option("header", hasColumnHeader)
      .option("inferSchema", "true").csv(inputPath);

Upvotes: 1

Related Questions