Reputation: 1375
I used Spark 2.1.1 and I upgraded into the latest version 2.4.4. I observed from Spark UI that the driver memory is increasing continuously and after of long running I had the following error: java.lang.OutOfMemoryError: GC overhead limit exceeded
In Spark 2.1.1 the driver memory consumption (Storage Memory tab) was extremely low and after the run of ContextCleaner and BlockManager the memory was decreasing.
Also, I tested the Spark versions 2.3.3, 2.4.3 and I had the same behavior.
HOW TO REPRODUCE THIS BEHAVIOR:
Create a very simple application(streaming count_file.py) in order to reproduce this behavior. This application reads CSV files from a directory, count the rows and then remove the processed files.
import os
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql import types as T
target_dir = "..."
spark=SparkSession.builder.appName("DataframeCount").getOrCreate()
while True:
for f in os.listdir(target_dir):
df = spark.read.load(f, format="csv")
print("Number of records: {0}".format(df.count()))
os.remove(f)
print("File {0} removed successfully!".format(f))
Submit code:
spark-submit
--master spark://xxx.xxx.xx.xxx
--deploy-mode client
--executor-memory 4g
--executor-cores 3
--queue streaming count_file.py
TESTED CASES WITH THE SAME BEHAVIOUR:
DEPENDENCIES
Upvotes: 5
Views: 8498
Reputation: 1375
Finally, the increase of the memory in Spark UI was a bug in Spark version higher than 2.3.3. There is a fix. It will affect the Spark version 2.4.5+.
Spark related issues:
Spark UI storage memory increasing overtime: https://issues.apache.org/jira/browse/SPARK-29055
Possible memory leak in Spark: https://issues.apache.org/jira/browse/SPARK-29321?page=com.atlassian.jira.plugin.system.issuetabpanels%3Aall-tabpanel
Upvotes: 3