Reputation: 771
I have a local CSV "test.csv" where the first row is the column names and the following rows are data. I tried reading in the CSV like this in Java:
Dataset<Row> test_table = sparkSession()
.sqlContext()
.read()
.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.load("test.csv");
This was suggested here:
Read csv as Data Frame in spark 1.6
But I keep getting the error:
java.lang.NegativeArraySizeException
at com.univocity.parsers.common.input.DefaultCharAppender.<init>(DefaultCharAppender.java:39)
at com.univocity.parsers.csv.CsvParserSettings.newCharAppender(CsvParserSettings.java:82)
at com.univocity.parsers.common.ParserOutput.<init>(ParserOutput.java:93)
at com.univocity.parsers.common.AbstractParser.<init>(AbstractParser.java:74)
at com.univocity.parsers.csv.CsvParser.<init>(CsvParser.java:59)
at org.apache.spark.sql.execution.datasources.csv.CsvReader.<init>(CSVParser.scala:49)
at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:61)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:184)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:184)
at scala.Option.orElse(Option.scala:289)
at org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$getOrInferFileFormatSchema(DataSource.scala:183)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:387)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:135)
What's the problem and how can I read from the CSV into a dataset?
Upvotes: 1
Views: 1153
Reputation: 6289
Author of the univocity-parsers library here. This is happening because internally spark is setting the maximum value length to -1
(meaning no limit). This was introduced in univocity-parsers versions 2.2.0 onward.
Just make sure this library version is greater than 2.2.0 and you should be fine, as the older versions don't support setting the maxCharsPerColumn
property to -1
.
If you have multiple versions of that library in your classpath, get rid of the older ones. Ideally you'd want to update to the latest version (currently 2.5.4.) and use only that. It should work just fine as we make sure any changes made to the library are backward compatible.
Upvotes: 1
Reputation: 1853
It is mainly due to the dependencies you are using. Try using other like
--packages com.databricks:spark-csv_2.10:1.5.0 or spark-csv_2.10:1.4.0
It should work.
Upvotes: 0