Jessica Smith
Jessica Smith

Reputation: 151

Does Apache Spark DataFrame loads data from DB for every processing or does it use the same data unless told otherwise?

We have a use case where we need to search for specific records that fulfill certain conditions. There are multiple of these conditions for which we need to identify the records. We plan to use apache Spark Dataframes. Does Apache Spark dataframes load the table data from db for every search that we plan to execute or does it load & distribute the table data among the spark cluster nodes once and then run the search conditions on these till it is explicitly told to load the data from db?

Upvotes: 0

Views: 111

Answers (1)

Ewan Leith
Ewan Leith

Reputation: 1665

If you create the Dataframe with a .cache() or .persist() command, then it'll attempt to persist the dataframe in memory.

If you don't create it with a .cache, then it'll read the data in from the source dataset on demand.

If there's not enough memory available to hold the full data set in cache, then Spark will recalculate some blocks on the fly.

If your source dataset is constantly changing, then you probably want to create a fairly static export dataset first.

Have a look at the Spark RDD persist documentation (it's the same for DataFrames) to get a better understanding of what you can do.

Upvotes: 2

Related Questions