ggeop
ggeop

Reputation: 1375

Memory leak in Spark Driver

I used Spark 2.1.1 and I upgraded into the latest version 2.4.4. I observed from Spark UI that the driver memory is increasing continuously and after of long running I had the following error: java.lang.OutOfMemoryError: GC overhead limit exceeded

In Spark 2.1.1 the driver memory consumption (Storage Memory tab) was extremely low and after the run of ContextCleaner and BlockManager the memory was decreasing.

Also, I tested the Spark versions 2.3.3, 2.4.3 and I had the same behavior.

HOW TO REPRODUCE THIS BEHAVIOR:

Create a very simple application(streaming count_file.py) in order to reproduce this behavior. This application reads CSV files from a directory, count the rows and then remove the processed files.

import os

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql import types as T

target_dir = "..."

spark=SparkSession.builder.appName("DataframeCount").getOrCreate()

while True:
    for f in os.listdir(target_dir):
        df = spark.read.load(f, format="csv")
        print("Number of records: {0}".format(df.count()))

        os.remove(f)
        print("File {0} removed successfully!".format(f))

Submit code:

spark-submit 
--master spark://xxx.xxx.xx.xxx
--deploy-mode client
--executor-memory 4g
--executor-cores 3
--queue streaming count_file.py

TESTED CASES WITH THE SAME BEHAVIOUR:

DEPENDENCIES

Upvotes: 5

Views: 8498

Answers (1)

ggeop
ggeop

Reputation: 1375

Finally, the increase of the memory in Spark UI was a bug in Spark version higher than 2.3.3. There is a fix. It will affect the Spark version 2.4.5+.

Spark related issues:

Upvotes: 3

Related Questions