Reputation: 331
When i write the intermediate DF to csv and read it back as Dataframe and perform operations is faster than I cache the intermeditate df(group_df in below flow) and perform operation on it..
Please see example
1. input_df(dataframe) => 20 million records
2. group_df(dataframe) => 27k records
input_df => group_df => perform operations
I am trying below options and 3rd looks to be faster.. Can you please explain this behavior.
1. group_df.cache()
2. group_df.persist(StorageLevel.DISK_ONLY)
3. write the group_df to csv and read it back as dataframe
Upvotes: 1
Views: 161
Reputation: 711
Of course! group_df.cache()
does lazy computation and most often only fragments of the data are stored in memory (and many fragments are removed in LRU fashion). In the latter case, all computation is finished and results were written, so the operations on groupd_df
data would just need to read from disk.
Upvotes: 1