Naresh Krishnamoorthy
Naresh Krishnamoorthy

Reputation: 331

Write/Read intermediate Dataframe works better than cache.. Is this expected behaviour?

When i write the intermediate DF to csv and read it back as Dataframe and perform operations is faster than I cache the intermeditate df(group_df in below flow) and perform operation on it..

Please see example

1. input_df(dataframe) => 20 million records
2. group_df(dataframe) => 27k records

input_df => group_df => perform operations

I am trying below options and 3rd looks to be faster.. Can you please explain this behavior.

1. group_df.cache()
2. group_df.persist(StorageLevel.DISK_ONLY)
3. write the group_df to csv and read it back as dataframe

Upvotes: 1

Views: 161

Answers (1)

Sai
Sai

Reputation: 711

Of course! group_df.cache() does lazy computation and most often only fragments of the data are stored in memory (and many fragments are removed in LRU fashion). In the latter case, all computation is finished and results were written, so the operations on groupd_df data would just need to read from disk.

Upvotes: 1

Related Questions