Ayush Goyal
Ayush Goyal

Reputation: 439

Error with join in scala spark. org.apache.spark.sql.AnalysisException: Resolved attribute(s) missing

Joining two dataframes is throwing a rather weird error in Scala Spark. I am trying to join two data sets namely input and metric

Snippet I am executing->

input.as("input").join(broadcast(metric.as("metric"), Seq("seller","seller_seller_tag"), "left_outer"))

The join breaks with ->

org.apache.spark.sql.AnalysisException: Resolved attribute(s) seller_seller_tag#2763 missing from seller#2483, seller_tag#2484, stats_date#2507, learner_id#2508, keyword_click#2493L, visible_impression#2494L, seller_tag_misc#439 in operator +- !Aggregate [seller#2483, seller_seller_tag#2763], [seller#2483, seller_seller_tag#2763, sum(visible_impression#2494L) AS visible_impression#527L, sum(keyword_click#2493L) AS keyword_click#529L]. Attribute(s) with the same name appear in the operation: seller_seller_tag. Please check if the right attribute(s) are used.;;

I have asserted both the dataframes have the seller_seller_tag column.

The weirdness->

This is the logical plan for metric https://dpaste.com/7NECV5DWX

This is the logical plan for input https://dpaste.com/DYSGLUWCY

And this is the logical plan for the join https://dpaste.com/AGH2BCN2G

In the plan for join, the column #id for seller_seller_tag changes suddenly, for the metric dataframe ->

+- !Aggregate [seller#2483, seller_seller_tag#2763], [seller#2483, seller_seller_tag#2763, sum(visible_impression#2494L) AS visible_impression#527L, sum(keyword_click#2493L) AS keyword_click#529L]
+- Project [seller#2483, seller_tag#2484, stats_date#2507, learner_id#2508, keyword_click#2493L, visible_impression#2494L, seller_tag_misc#439, concat(seller#2483, __, seller_tag_misc#439) AS seller_seller_tag#482]

while the same step in the the plan for metric is->

+- Aggregate [seller#2483, seller_seller_tag#482], [seller#2483, seller_seller_tag#482, sum(visible_impression#2494L) AS visible_impression#527L, sum(keyword_click#2493L) AS keyword_click#529L]
+- Project [seller#2483, seller_tag#2484, stats_date#2507, learner_id#2508, keyword_click#2493L, visible_impression#2494L, seller_tag_misc#439, concat(seller#2483, __, seller_tag_misc#439) AS seller_seller_tag#482]

The plan changes ids on join! Explanation!?

Upvotes: 1

Views: 374

Answers (1)

Ayush Goyal
Ayush Goyal

Reputation: 439

This is a bug with spark 2.4.0 (and probably lesser, not sure). Even though the SQL appears legitimate, AnalysisException will be thrown when your query has multiple joins. This happens because of the ResolveReferences.dedupRight , an internal method spark uses in the process of resolving plans. Thanks to mazaneicha for pointing this out. You can know more about the issue and reproduce it from here https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-32280

The issue has been fixed in the higher versions of spark (2.4.7, 3.0.1, 3.1.0). https://issues.apache.org/jira/browse/SPARK-32280?attachmentSortBy=dateTime

Upvotes: 1

Related Questions