conetfun
conetfun

Reputation: 1615

Is there an extra overhead to cache Spark dataframe in memory?

I am new to Spark and wanted to understand if there is an extra overhead/delay to persist and un-persist a dataframe in memory.

From what I know so far that there is not data movement that happens when we used cache a dataframe and it is just saved on executor's memory. So it should be just a matter of setting/unsetting a flag.

I am caching a dataframe in a spark streaming job and wanted to know if this could lead to additional delay in batch execution.

Upvotes: 1

Views: 833

Answers (1)

Jacek Laskowski
Jacek Laskowski

Reputation: 74619

if there is an extra overhead/delay to persist and un-persist a dataframe in memory.

It depends. If you only mark a DataFrame to be persisted, nothing really happens since it's a lazy operation. You have to execute an action to trigger DataFrame persistence / caching. With the action you do add an extra overhead.

Moreover, think of persistence (caching) as a way to precompute data and save it closer to executors (memory, disk or their combinations). This moving data from where it lives to executors does add an extra overhead at execution time (even if it's just a tiny bit).

Internally, Spark manages data as blocks (using BlockManagers on executors). They're peers to exchange blocks on demand (using torrent-like protocol).

Unpersisting a DataFrame is simply to send a request (sync or async) to BlockManagers to remove RDD blocks. If it happens in async manner, the overhead is none (minus the extra work executors have to do while running tasks).

So it should be just a matter of setting/unsetting a flag.

In a sense, that's how it is under the covers. Since a DataFrame or an RDD are just abstractions to describe distributed computations and do nothing at creation time, this persist / unpersist is just setting / unsetting a flag.

The change can be noticed at execution time.

I am caching a dataframe in a spark streaming job and wanted to know if this could lead to additional delay in batch execution.

If you use async caching (the default), there should be a very minimal delay.

Upvotes: 1

Related Questions