Reputation: 41858
I have three tables, and both have keys in one table, so I did a join on A & B = D
Now I want to finish the join with a join with D & C
The problem is I get this error:
org.apache.spark.sql.AnalysisException: Cannot resolve column name "ClaimKey" among (_1, _2);
at org.apache.spark.sql.Dataset$$anonfun$resolve$1.apply(Dataset.scala:219)
This is the actual code, from Zeppelin:
joinedperson.printSchema
filteredtable.printSchema
val joined = joinedperson.joinWith(filteredtable,
filteredtable.col("ClaimKey") === joinedperson.col("ClaimKey"))
These are the schemas of the two tables I am trying to join, and the problem is with ClaimKey in the first schema.
root
|-- _1: struct (nullable = false)
| |-- clientID: string (nullable = true)
| |-- PersonKey: string (nullable = true)
| |-- ClaimKey: string (nullable = true)
|-- _2: struct (nullable = false)
| |-- ClientID: string (nullable = true)
| |-- MyPersonKey: string (nullable = true)
root
|-- clientID: string (nullable = true)
|-- ClaimType: string (nullable = true)
|-- ClaimKey: string (nullable = true)
I had read the original data in from parquet files, then I used case classes to map the rows into classes, and have DataSets.
I expect it is due to the tuples, so how can I do this join?
Upvotes: 1
Views: 5072
Reputation: 37832
The structure of your first DataFrame is nested - ClaimKey
is a field within another field (_1
); To access such a field, you can simply give the "route" to that field with parent fields separated by dots:
val joined = joinedperson.joinWith(filteredtable,
filteredtable.col("ClaimKey") === joinedperson.col("_1.ClaimKey"))
Upvotes: 4