Rajneesh Kumar
Rajneesh Kumar

Reputation: 167

"Size in Memory" under storage tab of spark UI showing increase in RAM usage over time for spark streaming

I am using spark streaming in my application. Data comes in the form of streaming files every 15 minute. I have allocated 10G of RAM to spark executors. With this setting my spark application is running fine. But by looking the spark UI, under Storage tab -> Size in Memory the usage of RAM keep on increasing over the time. enter image description here When I started streaming job, "Size in Memory" usage was in KB. Today it has been 2 weeks 2 days 22 hours since when I started the streaming job and usage has increased to 858.4 MB. Also I have noticed on more thing, under Streaming heading: enter image description here

When I started the job, Processing Time and Total Delay (from the image) was 5 second and which after 16 days, increased to 19-23 seconds while the streaming file size is almost same. Before increasing the executor memory to 10G, spark jobs keeps on failing almost every 5 days (with default executor memory which is 1GB). With increase of executor memory to 10G, it is running continuously from more than 16 days.

I am worried about the memory issues. If "Size in Memory" values keep on increasing like this, then sooner or later I will run out of RAM and spark job will get fail again with 10G of executor memory as well. What I can do to avoid this? Do I need to do some configuration?

Just to give the context of my spark application, I have enable following properties in spark context:

SparkConf sparkConf = new SparkConf().setMaster(sparkMaster).                               .set("spark.streaming.receiver.writeAheadLog.enable", "true")
        .set("spark.streaming.minRememberDuration", 1440);

And also, I have enable checkpointing like following:

sc.checkpoint(hadoop_directory)

I want to highlight one more thing is that I was having issue while enabling checkpointing. Regarding checkpointing issue, I have already posted a question on following link: Spark checkpoining error when joining static dataset with DStream

I was not able to set the checkpointing the way I wanted, so did it differently (highlighted above) and it is working fine now. I am not asking checkpointing question again, however I mentioned it so that it might help you to understand if current memory issue somehow related to previous one (checkpointing).

Environment detail: Spark 1.4.1 with two node cluster of centos 7 machines. Hadoop 2.7.1.

Upvotes: 4

Views: 2800

Answers (1)

David Schwartz
David Schwartz

Reputation: 182827

I am worried about the memory issues. If "Size in Memory" values keep on increasing like this, then sooner or later I will run out of RAM and spark job will get fail again with 10G of executor memory as well.

No, that's not how RAM works. Running out is perfectly normal, and when you run out, you take RAM that you are using for less important purposes and use it for more important purposes.

For example, if your system has free RAM, it can try to keep everything it wrote to disk in RAM. Who knows, somebody might try to read it from disk again and having it in RAM will save an I/O operation. Since free RAM is forever wasted (it's not like you can use 1GB less today to use 1GB more tomorrow, any RAM not used right now is potential to avoid I/O that's forever lost) you might as well use it for anything that might help. But that doesn't mean it can't evict those things from RAM when it needs RAM for some other purpose.

It is not at all unusual, on a typical system, for almost all of its RAM to be used and almost all of its RAM to also be available. This is typical behavior on most modern systems.

Upvotes: 1

Related Questions