Hayat Behlim
Hayat Behlim

Reputation: 386

Got `java.nio.BufferOverflowException` in pyspark dataframe count function

I'm working with the following environment: spark = 2.0.0,hdp = 2.5.3.0,python = 2.7,yarn-client

My PySpark code works fine most of time.
However sometimes I get below exception on df.count() function

Code that works for me:

df= spark.read.orc("${path}")
df.count()

Code in which I get exception:

df= spark.read.orc("${path}")
df = df.cache()
df.count()

Stacktrace:

  Job aborted due to stage failure: Task 0 in stage 4.0 failed 4 times, most recent failure: Lost task 0.3 in stage 4.0 (TID 9, apii): java.nio.BufferOverflowException +details
    Job aborted due to stage failure: Task 0 in stage 4.0 failed 4 times, most recent failure: Lost task 0.3 in stage 4.0 (TID 9, apii): java.nio.BufferOverflowException
        at java.nio.Buffer.nextPutIndex(Buffer.java:521)
        at java.nio.HeapByteBuffer.put(HeapByteBuffer.java:169)
        at org.apache.spark.sql.execution.columnar.BOOLEAN$.append(ColumnType.scala:286)
        at org.apache.spark.sql.execution.columnar.compression.RunLengthEncoding$Encoder.compress(compressionSchemes.scala:143)
        at org.apache.spark.sql.execution.columnar.compression.CompressibleColumnBuilder$class.build(CompressibleColumnBuilder.scala:103)
        at org.apache.spark.sql.execution.columnar.NativeColumnBuilder.build(ColumnBuilder.scala:97)

Any help is appreciated :)

Upvotes: 0

Views: 911

Answers (1)

Yaron
Yaron

Reputation: 10450

When using cache() on a RDD/dataframe, the memory allocated for cached RDD is taken from the executor memory.

i.e. if the executor memory was 8GB, and the size of the cached RDD is 3GB, the executor will have only 5GB of RAM (instead of 8GB) which might cause the buffer overflow issue you are facing.

I guess that increasing the RAM allocated for each executor and/or increasing the number of executors (usually together with increasing the number of partitions) might cause the buffer overflow error to disappear.

Short quote from memory management section in spark-architecture :

Execution Memory

  • storage for data needed during tasks execution
  • shuffle-related data

Storage Memory

  • storage of cached RDDs and broadcast variables
  • possible to borrow from execution memory (spill otherwise)
  • safeguard value is 50% of Spark Memory when cached blocks are immune to eviction

Upvotes: 1

Related Questions