Reputation: 25
If I do not cache a dataframe which is generated using spark SQL with limit option, will I have unstable results whenever I edit the resulted dataframe and show it?
Description.
I have a table like below which is generated by using spark SQL with limit option:
+---------+---+---+---+---+
|partition| | 0| 1| 2|
+---------+---+---+---+---+
| 0| 0| 0| 10| 18|
| 1| 0| 0| 10| 17|
| 2| 0| 0| 13| 17|
+---------+---+---+---+---+
And if I add a column to get the row sum, and show()
that again, the dataframe has different items like below:
+---------+---+---+---+---+-------+-----------+------------+------------------+------------------+
|partition| | 0| 1| 2|row_sum|percent of |percent of 0| percent of 1| percent of 2|
+---------+---+---+---+---+-------+-----------+------------+------------------+------------------+
| 0| 0| 0| 10| 13| 23| 0.0| 0.0| 43.47826086956522| 56.52173913043478|
| 1| 0| 0| 13| 16| 29| 0.0| 0.0|44.827586206896555|55.172413793103445|
| 2| 0| 0| 15| 14| 29| 0.0| 0.0|51.724137931034484|48.275862068965516|
+---------+---+---+---+---+-------+-----------+------------+------------------+------------------+
I suspect that the code for editing the original dataframe obtained from the first spark SQL query will result in re-running the very first spark SQL query and edit from the new result.
Is this true?
Upvotes: 1
Views: 605
Reputation: 3008
Cache()
in spark is a transformation and is lazily evaluated when you call any action on that dataframe.
Now if your are writing a query to fetch only 10 records using limit then when you call an action like show
on it would materialize the code and get 10 records at that time. Now if you have not cache the dataframe and if you perform multiple transformation and again perform an action on the newly created dataframe then spark would perform the transformations from the root of the graph and that is the reason you would see different output every time if you don't cache that dataframe.
Upvotes: 2