Reputation: 386
I'm working with the following environment: spark = 2.0.0
,hdp = 2.5.3.0
,python = 2.7
,yarn-client
My PySpark code works fine most of time.
However sometimes I get below exception on df.count()
function
Code that works for me:
df= spark.read.orc("${path}")
df.count()
Code in which I get exception:
df= spark.read.orc("${path}")
df = df.cache()
df.count()
Stacktrace:
Job aborted due to stage failure: Task 0 in stage 4.0 failed 4 times, most recent failure: Lost task 0.3 in stage 4.0 (TID 9, apii): java.nio.BufferOverflowException +details
Job aborted due to stage failure: Task 0 in stage 4.0 failed 4 times, most recent failure: Lost task 0.3 in stage 4.0 (TID 9, apii): java.nio.BufferOverflowException
at java.nio.Buffer.nextPutIndex(Buffer.java:521)
at java.nio.HeapByteBuffer.put(HeapByteBuffer.java:169)
at org.apache.spark.sql.execution.columnar.BOOLEAN$.append(ColumnType.scala:286)
at org.apache.spark.sql.execution.columnar.compression.RunLengthEncoding$Encoder.compress(compressionSchemes.scala:143)
at org.apache.spark.sql.execution.columnar.compression.CompressibleColumnBuilder$class.build(CompressibleColumnBuilder.scala:103)
at org.apache.spark.sql.execution.columnar.NativeColumnBuilder.build(ColumnBuilder.scala:97)
Any help is appreciated :)
Upvotes: 0
Views: 911
Reputation: 10450
When using cache()
on a RDD/dataframe, the memory allocated for cached RDD
is taken from the executor memory.
i.e. if the executor memory was 8GB, and the size of the cached RDD
is 3GB, the executor will have only 5GB of RAM (instead of 8GB) which might cause the buffer overflow
issue you are facing.
I guess that increasing the RAM allocated for each executor and/or increasing the number of executors (usually together with increasing the number of partitions) might cause the buffer overflow
error to disappear.
Short quote from memory management section in spark-architecture :
Execution Memory
Storage Memory
Upvotes: 1