Spark 2.1.0, cannot resolve column name when doing second join

Question

I have three tables, and both have keys in one table, so I did a join on A & B = D

Now I want to finish the join with a join with D & C

The problem is I get this error:

org.apache.spark.sql.AnalysisException: Cannot resolve column name "ClaimKey" among (_1, _2);
  at org.apache.spark.sql.Dataset$$anonfun$resolve$1.apply(Dataset.scala:219)

This is the actual code, from Zeppelin:

joinedperson.printSchema
filteredtable.printSchema
val joined = joinedperson.joinWith(filteredtable, 
    filteredtable.col("ClaimKey") === joinedperson.col("ClaimKey"))

These are the schemas of the two tables I am trying to join, and the problem is with ClaimKey in the first schema.

root
 |-- _1: struct (nullable = false)
 |    |-- clientID: string (nullable = true)
 |    |-- PersonKey: string (nullable = true)
 |    |-- ClaimKey: string (nullable = true)
 |-- _2: struct (nullable = false)
 |    |-- ClientID: string (nullable = true)
 |    |-- MyPersonKey: string (nullable = true)
root
 |-- clientID: string (nullable = true)
 |-- ClaimType: string (nullable = true)
 |-- ClaimKey: string (nullable = true)

I had read the original data in from parquet files, then I used case classes to map the rows into classes, and have DataSets.

I expect it is due to the tuples, so how can I do this join?

Tzach Zohar · Accepted Answer

The structure of your first DataFrame is nested - ClaimKey is a field within another field (_1); To access such a field, you can simply give the "route" to that field with parent fields separated by dots:

val joined = joinedperson.joinWith(filteredtable, 
  filteredtable.col("ClaimKey") === joinedperson.col("_1.ClaimKey"))

Spark 2.1.0, cannot resolve column name when doing second join

Answers (1)

Related Questions