Reputation: 483
This is my sample code:
val file = sc.textFile(fileapath).persist(StorageLevel.MEMORY_AND_DISK_2)
I am running a series of tasks such as reading file from hdfs and count its record then again doing some preprocessor like binning or join or groupby then finding the count of result and then saving file to hdfs.
I noticed one point that while running this with preprocessor join it sometimes stuck due to this persistance of data.
If I remove this persist it will run. Now I have some doubts about persistance of data. Why persist only effective for some task .
Please help me to figure out my doubts
Upvotes: 0
Views: 647
Reputation: 10450
A lot of important data is missing in your question:
From the data you provided:
It is possible that removing the "_2" will help:
MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc. but replicate each partition on two cluster nodes. - do you need to replicate the data?
val file = sc.textFile(fileapath).persist(StorageLevel.MEMORY_AND_DISK)
It is possible that using MEMORY_ONLY will help:
val file = sc.textFile(fileapath).persist(StorageLevel.MEMORY_ONLY)
It is possible that using MEMORY_ONLY_SER will be better:
val file = sc.textFile(fileapath).persist(StorageLevel.MEMORY_ONLY)
MEMORY_ONLY_SER - Store RDD as serialized Java objects (one byte array per partition). This is generally more space-efficient than deserialized objects, especially when using a fast serializer, but more CPU-intensive to read.
more info can be found here: http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence
Upvotes: 4