Haoliang
Haoliang

Reputation: 1097

Where does Spark actually persist RDDs on disk?

I am using persist on different storage levels, but I found no difference on performance when I was using MEMORY_ONLY and DISK_ONLY.

I think there might be something wrong with my code... Where can I find the persisted RDDs on disk so that I can make sure they were actually persisted?

Upvotes: 9

Views: 9351

Answers (2)

yjshen
yjshen

Reputation: 6693

Two possible reasons for your observation:

  • RDDs are persisted in a lazy fashion, therefore, to make it work you should call an action(e.g. count()) on it after you call persist()
  • Even if you make sure the persist() happens, the actual data may not write to disk actually, your write method is returned directly after the data is write into buffer cache, therefore, when you read it next to write, it simply return the cached data.

So, Did persist happens? Did you clear linux Buffer cache on each node after persist rdd as DISK_ONLY, before operate on it and measure performance?

So what I suggest you to do is:

  1. persist rdd as DISK_ONLY, invoke an action(e.g. count()), to make it persist.
  2. sleep the application for a few seconds, clear the cache of all the worker node during this period
    sync && echo 3 > /proc/sys/vm/drop_caches
  3. resume your procedure, and measure the performance of persisted RDD.

Upvotes: 2

Francois G
Francois G

Reputation: 11985

As per the doc:

spark.local.dir (by default /tmp)

Directory to use for "scratch" space in Spark, including map output files and RDDs that get stored on disk. This should be on a fast, local disk in your system. It can also be a comma-separated list of multiple directories on different disks. NOTE: In Spark 1.0 and later this will be overriden by SPARK_LOCAL_DIRS (Standalone, Mesos) or LOCAL_DIRS (YARN) environment variables set by the cluster manager.

Upvotes: 8

Related Questions