Jessica
Jessica

Reputation: 771

Error reading local CSV into spark dataset

I have a local CSV "test.csv" where the first row is the column names and the following rows are data. I tried reading in the CSV like this in Java:

Dataset<Row> test_table = sparkSession()
    .sqlContext()
    .read()
    .format("com.databricks.spark.csv")
    .option("header", "true")
    .option("inferSchema", "true")
    .load("test.csv");

This was suggested here:
Read csv as Data Frame in spark 1.6

But I keep getting the error:

java.lang.NegativeArraySizeException
    at com.univocity.parsers.common.input.DefaultCharAppender.<init>(DefaultCharAppender.java:39)
    at com.univocity.parsers.csv.CsvParserSettings.newCharAppender(CsvParserSettings.java:82)
    at com.univocity.parsers.common.ParserOutput.<init>(ParserOutput.java:93)
    at com.univocity.parsers.common.AbstractParser.<init>(AbstractParser.java:74)
    at com.univocity.parsers.csv.CsvParser.<init>(CsvParser.java:59)
    at org.apache.spark.sql.execution.datasources.csv.CsvReader.<init>(CSVParser.scala:49)
    at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:61)
    at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:184)
    at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:184)
    at scala.Option.orElse(Option.scala:289)
    at org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$getOrInferFileFormatSchema(DataSource.scala:183)
    at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:387)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:135)

What's the problem and how can I read from the CSV into a dataset?

Upvotes: 1

Views: 1153

Answers (2)

Jeronimo Backes
Jeronimo Backes

Reputation: 6289

Author of the univocity-parsers library here. This is happening because internally spark is setting the maximum value length to -1 (meaning no limit). This was introduced in univocity-parsers versions 2.2.0 onward.

Just make sure this library version is greater than 2.2.0 and you should be fine, as the older versions don't support setting the maxCharsPerColumn property to -1.

If you have multiple versions of that library in your classpath, get rid of the older ones. Ideally you'd want to update to the latest version (currently 2.5.4.) and use only that. It should work just fine as we make sure any changes made to the library are backward compatible.

Upvotes: 1

Nayan Sharma
Nayan Sharma

Reputation: 1853

It is mainly due to the dependencies you are using. Try using other like

   --packages com.databricks:spark-csv_2.10:1.5.0 or spark-csv_2.10:1.4.0 

It should work.

Upvotes: 0

Related Questions