Got `java.nio.BufferOverflowException` in pyspark dataframe count function

Question

I'm working with the following environment: spark = 2.0.0,hdp = 2.5.3.0,python = 2.7,yarn-client

My PySpark code works fine most of time.
However sometimes I get below exception on df.count() function

Code that works for me:

df= spark.read.orc("${path}")
df.count()

Code in which I get exception:

df= spark.read.orc("${path}")
df = df.cache()
df.count()

Stacktrace:

  Job aborted due to stage failure: Task 0 in stage 4.0 failed 4 times, most recent failure: Lost task 0.3 in stage 4.0 (TID 9, apii): java.nio.BufferOverflowException +details
    Job aborted due to stage failure: Task 0 in stage 4.0 failed 4 times, most recent failure: Lost task 0.3 in stage 4.0 (TID 9, apii): java.nio.BufferOverflowException
        at java.nio.Buffer.nextPutIndex(Buffer.java:521)
        at java.nio.HeapByteBuffer.put(HeapByteBuffer.java:169)
        at org.apache.spark.sql.execution.columnar.BOOLEAN$.append(ColumnType.scala:286)
        at org.apache.spark.sql.execution.columnar.compression.RunLengthEncoding$Encoder.compress(compressionSchemes.scala:143)
        at org.apache.spark.sql.execution.columnar.compression.CompressibleColumnBuilder$class.build(CompressibleColumnBuilder.scala:103)
        at org.apache.spark.sql.execution.columnar.NativeColumnBuilder.build(ColumnBuilder.scala:97)

Any help is appreciated :)

Got `java.nio.BufferOverflowException` in pyspark dataframe count function

Answers (1)

Related Questions