Reputation: 151
We have a use case where we need to search for specific records that fulfill certain conditions. There are multiple of these conditions for which we need to identify the records. We plan to use apache Spark Dataframes. Does Apache Spark dataframes load the table data from db for every search that we plan to execute or does it load & distribute the table data among the spark cluster nodes once and then run the search conditions on these till it is explicitly told to load the data from db?
Upvotes: 0
Views: 111
Reputation: 1665
If you create the Dataframe with a .cache() or .persist() command, then it'll attempt to persist the dataframe in memory.
If you don't create it with a .cache, then it'll read the data in from the source dataset on demand.
If there's not enough memory available to hold the full data set in cache, then Spark will recalculate some blocks on the fly.
If your source dataset is constantly changing, then you probably want to create a fairly static export dataset first.
Have a look at the Spark RDD persist documentation (it's the same for DataFrames) to get a better understanding of what you can do.
Upvotes: 2