Reputation: 143
I have two dataframe with below schema:
clusterDF schema
root
|-- cluster_id: string (nullable = true)
df schema
root
|-- cluster_id: string (nullable = true)
|-- name: string (nullable = true)
Trying to join these using
val nameDF = clusterDF.join(df, col("clusterDF.cluster_id") === col("df.cluster_id"), "left" )
But above code fails with:
org.apache.spark.sql.AnalysisException: cannot resolve '`clusterDF.cluster_id`' given input columns: [cluster_id, cluster_id, name];;
'Join LeftOuter, ('clusterDF.cluster_id = 'df.cluster_id)
:- Aggregate [cluster_id#0], [cluster_id#0]
: +- Project [cluster_id#0]
: +- Filter (name#18 = kroger)
: +- Project [cluster_id#0, name#18]
: +- Generate explode(influencers#1.screenName), true, false, [name#18]
: +- Relation[cluster_id#0,influencers#1] json
+- Project [cluster_id#26, name#18]
+- Generate explode(influencers#27.screenName), true, false, [name#18]
+- Relation[cluster_id#26,influencers#27] json
Seems very weird to me. Any suggestions please.
Upvotes: 0
Views: 6071
Reputation: 41987
The error message is clear enough
org.apache.spark.sql.AnalysisException: cannot resolve '
clusterDF.cluster_id
' given input columns: [cluster_id, cluster_id, name];;
which says that the column names you are using is wrong, use one of the following methods
val nameDF = clusterDF.join(df, clusterDF("cluster_id") === df("cluster_id"), "left")
or
import org.apache.spark.sql.functions._
val nameDF = clusterDF.as("table1").join(df.as("table2"), col("table1.cluster_id") === col("table2.cluster_id"), "left")
or
import spark.implicits._
val nameDF = clusterDF.as("table1").join(df.as("table2"), $"table1.cluster_id" === $"table2.cluster_id"), "left")
or with newer versions
val nameDF = clusterDF.join(df, clusterDF('cluster_id) === df('cluster_id), "left")
Upvotes: 4