noobnoob
noobnoob

Reputation: 167

Getting com.univocity.parsers.common.TextParsingException while loading a csv file

I'm trying to join a tsv dataset which has a lot of new lines in the data to another dataframe and keep getting

com.univocity.parsers.common.TextParsingException

I've already cleaned my data to replace \N with NAs as I thought that could be the reason but to no success.

The error points me to the following record in the faulty data

tt0100054 2 Повелитель мух SUHH ru NA NA 0

The stacktrace is as follows

    19/03/02 17:45:42 ERROR Executor: Exception in task 0.0 in stage 10.0 (TID 10)
com.univocity.parsers.common.TextParsingException: Length of parsed input (1000001) exceeds the maximum number of characters defined in your parser settings (1000000). 
Identified line separator characters in the parsed content. This may be the cause of the error. The line separator in your parser settings is set to '\n'. Parsed content:
    Sesso e il poliziotto sposato   IT  NA  NA  NA  0[\n]
    tt0097089   4   Sex and the Married Detective   US  NA  NA  NA  0[\n]`tt0100054 1   Fluenes herre   NO  NA  imdbDisplay NA  0
tt0100054   20  Kärpästen herra FI  NA  NA  NA  0
tt0100054   2
    at com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:302)
    at com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:431)
    at org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:148)
    at org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:131)
    at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
    at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
    at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
    at org.apache.spark.scheduler.Task.run(Task.scala:86)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 1000000
    at com.univocity.parsers.common.input.AbstractCharInputReader.appendUtilAnyEscape(AbstractCharInputReader.java:331)
    at com.univocity.parsers.csv.CsvParser.parseQuotedValue(CsvParser.java:246)
    at com.univocity.parsers.csv.CsvParser.parseRecord(CsvParser.java:119)
    at com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:400)
    ... 22 more

I've already tried setting the following in the csv option("maxCharsPerCol","110000000") .option("multiLine","true"), it doesn't help. I'd appreciate any help fixing this.

I'm using spark 2.0.2 & scala 2.11.8.

Upvotes: 0

Views: 8221

Answers (3)

Andrew L
Andrew L

Reputation: 11

For anyone encountering this issue reading wide CSV files within Spark, see https://spark.apache.org/docs/latest/sql-data-sources-csv.html

The CSV reader in Spark has a setting maxColumns which is set to a default of 20480 (as of Spark 3.3).

You can increase this limit by setting it to a number at least as large as the expected number of columns (if known):

spark.read.format("csv").option("header", "true").option("maxColumns", 500000).load(filename)

Keep in mind that there's a tradeoff with increasing maxColumns -- you're preallocating more memory, and so at a certain point, you'll run out of memory from preallocating too much extra space.

Upvotes: 1

Akhil
Akhil

Reputation: 646

Jeronimo's answer will solve this issue.

Just adding a sample code block in case you are wondering how to do this spark.

val tsvData = spark.read.option("header","true").option("inferSchema",
"true").option("delimiter","\t").option("quote","\0").csv(csvFilePath)

Upvotes: 0

Jeronimo Backes
Jeronimo Backes

Reputation: 6289

Author of univocity-parsers here.

The parser was built to fail fast when something is potentially wrong with either your program (i.e. the file format was not configured correctly) or the input file (i.e. the input file doesn't have the format your program expects, or has unescaped/unclosed quotes).

The stack trace shows this:

Sesso e il poliziotto sposato   IT  NA  NA  NA  0[\n]
tt0097089   4   Sex and the Married Detective   US  NA  NA  NA  0[\n]`tt0100054 1   Fluenes herre   NO  NA  imdbDisplay NA  0
tt0100054   20  Kärpästen herra FI  NA  NA  NA  0
tt0100054   2

Which clearly shows the content of multiple rows being read as if they were part of a single value. This means that somewhere around this text in your input file there are values starting with a quote that is never not closed.

You can configure the parser to not try to handle quoted values with this:

settings.getFormat().setQuote('\0');

If you are sure your format configuration is correct and that there are very long values in the input, set maxCharsPerColumn to -1.

Lastly, it looks like you are parsing TSV, which is not CSV and should be processed differently. If that's the case, you can also try to use the TsvParser instead.

Hope this helps

Upvotes: 5

Related Questions