Reputation: 439
Joining two dataframes is throwing a rather weird error in Scala Spark. I am trying to join two data sets namely input and metric
Snippet I am executing->
input.as("input").join(broadcast(metric.as("metric"), Seq("seller","seller_seller_tag"), "left_outer"))
The join breaks with ->
org.apache.spark.sql.AnalysisException: Resolved attribute(s) seller_seller_tag#2763 missing from seller#2483, seller_tag#2484, stats_date#2507, learner_id#2508, keyword_click#2493L, visible_impression#2494L, seller_tag_misc#439 in operator +- !Aggregate [seller#2483, seller_seller_tag#2763], [seller#2483, seller_seller_tag#2763, sum(visible_impression#2494L) AS visible_impression#527L, sum(keyword_click#2493L) AS keyword_click#529L]. Attribute(s) with the same name appear in the operation: seller_seller_tag. Please check if the right attribute(s) are used.;;
I have asserted both the dataframes have the seller_seller_tag column.
The weirdness->
This is the logical plan for metric https://dpaste.com/7NECV5DWX
This is the logical plan for input https://dpaste.com/DYSGLUWCY
And this is the logical plan for the join https://dpaste.com/AGH2BCN2G
In the plan for join, the column #id for seller_seller_tag changes suddenly, for the metric dataframe ->
+- !Aggregate [seller#2483, seller_seller_tag#2763], [seller#2483, seller_seller_tag#2763, sum(visible_impression#2494L) AS visible_impression#527L, sum(keyword_click#2493L) AS keyword_click#529L]
+- Project [seller#2483, seller_tag#2484, stats_date#2507, learner_id#2508, keyword_click#2493L, visible_impression#2494L, seller_tag_misc#439, concat(seller#2483, __, seller_tag_misc#439) AS seller_seller_tag#482]
while the same step in the the plan for metric is->
+- Aggregate [seller#2483, seller_seller_tag#482], [seller#2483, seller_seller_tag#482, sum(visible_impression#2494L) AS visible_impression#527L, sum(keyword_click#2493L) AS keyword_click#529L]
+- Project [seller#2483, seller_tag#2484, stats_date#2507, learner_id#2508, keyword_click#2493L, visible_impression#2494L, seller_tag_misc#439, concat(seller#2483, __, seller_tag_misc#439) AS seller_seller_tag#482]
The plan changes ids on join! Explanation!?
Upvotes: 1
Views: 374
Reputation: 439
This is a bug with spark 2.4.0 (and probably lesser, not sure). Even though the SQL appears legitimate, AnalysisException
will be thrown when your query has multiple joins. This happens because of the ResolveReferences.dedupRight
, an internal method spark uses in the process of resolving plans. Thanks to mazaneicha for pointing this out. You can know more about the issue and reproduce it from here https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-32280
The issue has been fixed in the higher versions of spark (2.4.7, 3.0.1, 3.1.0). https://issues.apache.org/jira/browse/SPARK-32280?attachmentSortBy=dateTime
Upvotes: 1