Reputation: 11
I have an example case:
val df = ... // read from HDFS / file / ...
println(df.count)
val newDf = df.select // Other transformations... Keep processing the original df.
My question is, do I need to cache the original dataframe if I'm counting it in the middle of my process? I mean, I count the df, and then keep transforming it and processing it.
Does the .count means the df will be computed twice?
Upvotes: 0
Views: 3174
Reputation: 330193
Without knowing more about your use case and resources it is hard to give you a definitive answer. However it is most likely negative, despite the fact that Spark will access the source twice.
Overall there are multiple factors to consider:
Dataset
API is quite high (that's why default storage mode is MEMORY_AND_DISK
) and can easily exceed the cost of loading the data.So...
Does the .count means the df will be computed twice?
To some extent depending on the input format. Plain text formats (like JSON or CSV) will require more repeated work than binary sources.
do I need to cache the original dataframe
Typically not, unless you know that cost of fetching data from the storage justifies disadvantages listed above.
Making the final decision requires a good understanding of your pipeline (primarily how data will be processed downstream and how caching affects this) and metrics you want to optimize (latency, total execution time, monetary cost of running required resources).
You should also consider alternatives to count
, like processing InputMetrics
or using Accumulators
.
Upvotes: 5