user2691592
user2691592

Reputation: 11

Typo in spark.sql.autoBroadcastJoinThreshold ...?

Presumably I could have found a typo in Spark version 3.1.1. I am using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 11.0.11)

scala> spark.conf.get("spark.sql.autoBroadcastJoinThreshold")

res0: String = 10485760b

But should possibly be: 104857600.

Therefore:

scala> spark.conf.set("spark.sql.autoBroadcastJoinThreshold", 104857600)

When you deploy with "10485760b", Spark cannot detect that one of the joined DataFrames is small (10 MB by default). The threshold for automatic broadcast join detection could be disabled. I hope my comment helps someone?

Upvotes: 1

Views: 1909

Answers (1)

Michael Heil
Michael Heil

Reputation: 18495

This is not a typo but the correct value.

According to the documentation on Spark configuration, autoBroadcastJoinThreshold has a default value of 10MB and is defined as

"Configures the maximum size in bytes for a table that will be broadcast to all worker nodes when performing a join."

Your proposed value of 104857600 would result in 104857600 / 1024 / 1024 = 100MB which can potentially cause harm to the health of the applications performance.

In addition, at the beginning of the documentation it explains what the "b" stand for:

enter image description here

Upvotes: 2

Related Questions