Reputation: 27373
I have a question regarding the usage of DataFram APIs cache
. Consider the following query:
val dfA = spark.table(tablename)
.cache
val dfC = dfA
.join(dfA.groupBy($"day").count,Seq("day"),"left")
So dfA
is used twice in this query, so I thought caching it would be benefical. But I'm confused about the plan, the table is still scanned twice (FileScan
appearing twice):
dfC.explain
== Physical Plan ==
*Project [day#8232, i#8233, count#8251L]
+- SortMergeJoin [day#8232], [day#8255], LeftOuter
:- *Sort [day#8232 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(day#8232, 200)
: +- InMemoryTableScan [day#8232, i#8233]
: +- InMemoryRelation [day#8232, i#8233], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
: +- *FileScan parquet mytable[day#8232,i#8233] Batched: true, Format: Parquet, Location: InMemoryFileIndex[hdfs://tablelocation], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<day:int,i:int>
+- *Sort [day#8255 ASC NULLS FIRST], false, 0
+- *HashAggregate(keys=[day#8255], functions=[count(1)])
+- Exchange hashpartitioning(day#8255, 200)
+- *HashAggregate(keys=[day#8255], functions=[partial_count(1)])
+- InMemoryTableScan [day#8255]
+- InMemoryRelation [day#8255, i#8256], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
+- *FileScan parquet mytable[day#8232,i#8233] Batched: true, Format: Parquet, Location: InMemoryFileIndex[hdfs://tablelocation], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<day:int,i:int>
Why isn't the table cached? Im using Spark 2.1.1
Upvotes: 4
Views: 427
Reputation: 503
Try with count() after cache so you trigger one action and the caching is done before the plan of the second one is "calculated".
As far as I know, the first action will trigger the cache, but since Spark planning is not dynamic, if your first action after cache uses the table twice, it will have to read it twice (because it won't cache the table until it executes that action).
If the above doesn't work [and/or you are hitting the bug mentioned], it's probably related to the plan, you can also try transforming the DF to RDD and then back to RDD (this way the plan will be 100% exact).
Upvotes: 1