Raphael Roth
Raphael Roth

Reputation: 27373

Cache not preventing multiple filescans?

I have a question regarding the usage of DataFram APIs cache. Consider the following query:

val dfA = spark.table(tablename)
.cache

val dfC = dfA
.join(dfA.groupBy($"day").count,Seq("day"),"left")

So dfA is used twice in this query, so I thought caching it would be benefical. But I'm confused about the plan, the table is still scanned twice (FileScan appearing twice):

dfC.explain

== Physical Plan ==
*Project [day#8232, i#8233, count#8251L]
+- SortMergeJoin [day#8232], [day#8255], LeftOuter
   :- *Sort [day#8232 ASC NULLS FIRST], false, 0
   :  +- Exchange hashpartitioning(day#8232, 200)
   :     +- InMemoryTableScan [day#8232, i#8233]
   :           +- InMemoryRelation [day#8232, i#8233], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
   :                 +- *FileScan parquet mytable[day#8232,i#8233] Batched: true, Format: Parquet, Location: InMemoryFileIndex[hdfs://tablelocation], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<day:int,i:int>
   +- *Sort [day#8255 ASC NULLS FIRST], false, 0
      +- *HashAggregate(keys=[day#8255], functions=[count(1)])
         +- Exchange hashpartitioning(day#8255, 200)
            +- *HashAggregate(keys=[day#8255], functions=[partial_count(1)])
               +- InMemoryTableScan [day#8255]
                     +- InMemoryRelation [day#8255, i#8256], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
                           +- *FileScan parquet mytable[day#8232,i#8233] Batched: true, Format: Parquet, Location: InMemoryFileIndex[hdfs://tablelocation], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<day:int,i:int>

Why isn't the table cached? Im using Spark 2.1.1

Upvotes: 4

Views: 427

Answers (1)

BiS
BiS

Reputation: 503

Try with count() after cache so you trigger one action and the caching is done before the plan of the second one is "calculated".

As far as I know, the first action will trigger the cache, but since Spark planning is not dynamic, if your first action after cache uses the table twice, it will have to read it twice (because it won't cache the table until it executes that action).

If the above doesn't work [and/or you are hitting the bug mentioned], it's probably related to the plan, you can also try transforming the DF to RDD and then back to RDD (this way the plan will be 100% exact).

Upvotes: 1

Related Questions