Reputation: 23
I am using Spark2.0.2 in local mode. I have a join which join two datasets.
It is quite fast when using spark sql or dataframe API (untyped Dataset[Row] ). But when I use typed Dataset API, I get the error below.
Exception in thread "main" org.apache.spark.sql.AnalysisException: Both sides of this join are outside the broadcasting threshold and computing it could be prohibitively expensive. To explicitly enable it, please set spark.sql.crossJoin.enabled = true;
I increase "spark.sql.conf.autoBroadcastJoinThreshold", but it is still the same error. And then I set "spark.sql.crossJoin.enabled" to "true", it works but takes very long time to complete.
I didn't do any repartition. The source are two parquets.
Any idea?
Upvotes: 2
Views: 1505
Reputation: 23
I found the reason. In my ds1, there is also a field "key2" which is the same as join key of ds2. After rename ds2("key2") to ds2("key3"), and the join below is fast now.
ds1.joinWith( broadcast(ds2), ds1("key1") === ds2("key3"), "left_outer")
Can anyone please explain the reason?
Upvotes: 0
Reputation: 1319
Autobroadcast threshold is limited to only 2GB (https://issues.apache.org/jira/browse/SPARK-6235) so if the table size is more than this value you will not be able to do so. Workaround could be to provide hint to sparksql using broadcast function as follows:
largeTableDf.join(broadcast(smallTableDf), "key"))
Upvotes: 3