Reputation: 99
Just wondering if Spark utilizes HDFS Centralized Caching, I can't seem to find anywhere that this is asked.
e.g.
hiveContext.sql("SELECT * FROM A_TABLE")
Would this utilize the cached blocks?
Upvotes: 2
Views: 674
Reputation: 26
It does use HDFS cached blocks but currently not optimized for it. For example the block might be cached on nodeA but task is scheduled on nodeB. If the block is local to nodeB then it will be read from disk. If the block in not local then HDFS will make sure to read it from nodeA where its cached I have a jira task open to optimize it although its not merged yet to spark trunk https://issues.apache.org/jira/browse/SPARK-19705
Upvotes: 1