York Huang
York Huang

Reputation: 23

Spark Dataset Error: Both sides of this join are outside the broadcasting threshold and computing it could be prohibitively expensive

I am using Spark2.0.2 in local mode. I have a join which join two datasets.

It is quite fast when using spark sql or dataframe API (untyped Dataset[Row] ). But when I use typed Dataset API, I get the error below.

Exception in thread "main" org.apache.spark.sql.AnalysisException: Both sides of this join are outside the broadcasting threshold and computing it could be prohibitively expensive. To explicitly enable it, please set spark.sql.crossJoin.enabled = true;

I increase "spark.sql.conf.autoBroadcastJoinThreshold", but it is still the same error. And then I set "spark.sql.crossJoin.enabled" to "true", it works but takes very long time to complete.

I didn't do any repartition. The source are two parquets.

Any idea?

Upvotes: 2

Views: 1505

Answers (2)

York Huang
York Huang

Reputation: 23

I found the reason. In my ds1, there is also a field "key2" which is the same as join key of ds2. After rename ds2("key2") to ds2("key3"), and the join below is fast now.

ds1.joinWith( broadcast(ds2), ds1("key1") === ds2("key3"), "left_outer")

Can anyone please explain the reason?

Upvotes: 0

Sachin Janani
Sachin Janani

Reputation: 1319

Autobroadcast threshold is limited to only 2GB (https://issues.apache.org/jira/browse/SPARK-6235) so if the table size is more than this value you will not be able to do so. Workaround could be to provide hint to sparksql using broadcast function as follows:

largeTableDf.join(broadcast(smallTableDf), "key"))

Upvotes: 3

Related Questions