Reputation: 692
I am currently reading in CSV data using the following code:
Dataset<Row> dataset = getSparkSession().read()
.option("header", "true")
.option("quote", '"')
.option("sep", ',')
.schema(schema)
.csv(path)
.toDF();
Which is directed to a CSV file that has rows that look like this:
"abc","city","123"
as well as another file that has rows that look like this:
"abc","city",123
The second one works fine because the schema I pass is
string, string, long
the first one results in java.lang.NumberFormatException: For input string: "123"
Is it possible for the CSV reader to properly read CSVs in both valid formats? Assuming options are passed.
I am using Spark 2.1.1
Upvotes: 1
Views: 962
Reputation: 1518
Using your code actually crashes for me. I suspect that using characters instead of Strings is the culprit. Using '"'.toString
for .option("quote",...)
fixes the crash, and works. Furthermore, you may want to also define the escape character, as in the following code.
In Cloudera's Spark2, I was able to use the following to parse both quoted and unquoted numbers to DecimalType
, with a pre-defined schema:
spark.read
.option("mode", "FAILFAST")
.option("escape", "\"")
.option("delimiter", DELIMITER)
.option("header", HASHEADER.toString)
.option("quote", "\"")
.option("nullValue", null)
.option("ignoreLeadingWhiteSpace", value = true)
.schema(SCHEMA)
.csv(PATH)
Examples of parsed numbers (from unit tests):
1.0
11
"15.23"
""
//empty field
"0.0000000001"
1111111111111.
000000000. //with leading space
This also works in my tests for IntegerType
- it can be parsed regardless of quotes.
Upvotes: 1
Reputation: 381
Use inferSchema
property which automatically identifies the data type of the columns.
var data= sparkSession.read
.option("header", hasColumnHeader)
.option("inferSchema", "true").csv(inputPath);
Upvotes: 1