J. Doe
J. Doe

Reputation: 11

Do you need to cache a dataframe if you want to count & later process it

I have an example case:

val df = ... // read from HDFS / file / ...
println(df.count)
val newDf = df.select // Other transformations... Keep processing the original df.

My question is, do I need to cache the original dataframe if I'm counting it in the middle of my process? I mean, I count the df, and then keep transforming it and processing it.
Does the .count means the df will be computed twice?

Upvotes: 0

Views: 3174

Answers (1)

zero323
zero323

Reputation: 330193

Without knowing more about your use case and resources it is hard to give you a definitive answer. However it is most likely negative, despite the fact that Spark will access the source twice.

Overall there are multiple factors to consider:

  • How much data will be loaded in the first pass. With efficient on-disk input format Spark (like Parquet) Spark has no need to fully load dataset at all. This also applies to a number of other input formats, including but not limited to JDBC reader.
  • Cost of caching the data with Dataset API is quite high (that's why default storage mode is MEMORY_AND_DISK) and can easily exceed the cost of loading the data.
  • Impact on the subsequent processing. In general caching will interfere with partition pruning, predicate pushdown and projections (see Any performance issues forcing eager evaluation using count in spark?).

So...

Does the .count means the df will be computed twice?

To some extent depending on the input format. Plain text formats (like JSON or CSV) will require more repeated work than binary sources.

do I need to cache the original dataframe

Typically not, unless you know that cost of fetching data from the storage justifies disadvantages listed above.

Making the final decision requires a good understanding of your pipeline (primarily how data will be processed downstream and how caching affects this) and metrics you want to optimize (latency, total execution time, monetary cost of running required resources).

You should also consider alternatives to count, like processing InputMetrics or using Accumulators.

Upvotes: 5

Related Questions