Is caching necessary for dataframe which is reused before the first Action?

Question

I have a dataframe, which I independently transform in different ways before joining result in the final DF. Intermediate transformed dataframes are never used in any "Actions". The first action is ever called only after all parts are joined together. My question is - should I cache the first dataframe then? Example:

arpu_df=get_arpu_df(..). #.cache() will help here?
sample_by_arpu_ranges=arpu_df.filter("arpu>50").sample(False,0.4)\
    .union(
        arpu_df.filter("arpu>20 and arpu<=50").sample(False,0.1)
        )\
    .union(
        arpu_df.filter("arpu<=20").sample(False,0.02)
        ).select("base_subsc_id")
sample_by_arpu_ranges.count()

sample is transformation, as far as I know. I wonder whether the arpu_df part will be recomputed to apply each of the filters, or the logical plan builder will understand that it can reuse it in the different parts of the plan?

abiratsis · Accepted Answer

Cache will get triggered only after calling an action hence in your case the answer is no the cache will not be beneficial before calling sample_by_arpu_ranges.count(). A common work around is to call the less expensive action which is count() just after cache(), then your code would look as the next one:

arpu_df=get_arpu_df(..)

arpu_df.cache()
arpu_df.count()

sample_by_arpu_ranges=arpu_df.filter("arpu>50").sample(False,0.4)\
    .union(
        arpu_df.filter("arpu>20 and arpu<=50").sample(False,0.1)
        )\
    .union(
        arpu_df.filter("arpu<=20").sample(False,0.02)
        ).select("base_subsc_id")
sample_by_arpu_ranges.count()

Is caching necessary for dataframe which is reused before the first Action?

Answers (2)

Related Questions