Reputation: 1097
I am using persist on different storage levels, but I found no difference on performance when I was using MEMORY_ONLY
and DISK_ONLY
.
I think there might be something wrong with my code... Where can I find the persisted RDDs on disk so that I can make sure they were actually persisted?
Upvotes: 9
Views: 9351
Reputation: 6693
Two possible reasons for your observation:
count()
) on it after you call persist()
persist()
happens, the actual data may not write to disk actually, your write method is returned directly after the data is write into buffer cache, therefore, when you read it next to write, it simply return the cached data.So, Did persist happens?
Did you clear linux Buffer cache
on each node after persist rdd as DISK_ONLY
, before operate on it and measure performance?
So what I suggest you to do is:
cache
of all the worker node during this periodsync && echo 3 > /proc/sys/vm/drop_caches
Upvotes: 2
Reputation: 11985
As per the doc:
spark.local.dir
(by default/tmp
)Directory to use for "scratch" space in Spark, including map output files and RDDs that get stored on disk. This should be on a fast, local disk in your system. It can also be a comma-separated list of multiple directories on different disks. NOTE: In Spark 1.0 and later this will be overriden by SPARK_LOCAL_DIRS (Standalone, Mesos) or LOCAL_DIRS (YARN) environment variables set by the cluster manager.
Upvotes: 8