Putt
Putt

Reputation: 299

Spark Broadcast results in increased size of dataframe

I have a dataframe of 1 integer column made of 1B rows. So ideally, the size of the dataframe should be 1B * 4 bytes ~= 4GB. This is proven to be correct when I cache the dataframe and check the size. The size is around 4GB.

Now, if I try to broadcast the same dataframe to join with another dataframe, I get an error: Caused by: org.apache.spark.SparkException: Cannot broadcast the table that is larger than 8GB: 14 GB

Why does the size of a broadcasted dataframe increase? I have seen this in other cases as well where a 300MB dataframe shows up as 3GB broadcasted dataframe in Spark UI SQL tab.

Any reasoning or help is appreciated.

Upvotes: 1

Views: 1492

Answers (2)

Ged
Ged

Reputation: 18043

More so an error according to this post. See https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-37321

Upvotes: 1

GoodManProg
GoodManProg

Reputation: 31

The size increases in memory, if dataframe was broadcasted across your cluster. How much it will increase depends on how many workers you have, because Spark needs to copy your dataframe on every worker to deal with your next operations.

Do not broadcast big dataframes, only small ones, to use in join operations.

As per link:

Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks.

Upvotes: 3

Related Questions