AkhilaV
AkhilaV

Reputation: 483

Why data with persist take time to execute a series of task in Scala

This is my sample code:

val file = sc.textFile(fileapath).persist(StorageLevel.MEMORY_AND_DISK_2)

I am running a series of tasks such as reading file from hdfs and count its record then again doing some preprocessor like binning or join or groupby then finding the count of result and then saving file to hdfs.

I noticed one point that while running this with preprocessor join it sometimes stuck due to this persistance of data.

If I remove this persist it will run. Now I have some doubts about persistance of data. Why persist only effective for some task .

Please help me to figure out my doubts

Upvotes: 0

Views: 647

Answers (1)

Yaron
Yaron

Reputation: 10450

A lot of important data is missing in your question:

  • What is the size of the input data?
  • What is the RAM size of your executors?
  • Why did you choose StorageLevel.MEMORY_AND_DISK? and why MEMORY_AND_DISK_2?
  • What is the actual code you are using in order to perform the transformation/actions?

From the data you provided:

It is possible that removing the "_2" will help:

MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc. but replicate each partition on two cluster nodes. - do you need to replicate the data?

val file = sc.textFile(fileapath).persist(StorageLevel.MEMORY_AND_DISK)

It is possible that using MEMORY_ONLY will help:

val file = sc.textFile(fileapath).persist(StorageLevel.MEMORY_ONLY)

It is possible that using MEMORY_ONLY_SER will be better:

val file = sc.textFile(fileapath).persist(StorageLevel.MEMORY_ONLY)

MEMORY_ONLY_SER - Store RDD as serialized Java objects (one byte array per partition). This is generally more space-efficient than deserialized objects, especially when using a fast serializer, but more CPU-intensive to read.

more info can be found here: http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence

Upvotes: 4

Related Questions