HBoulmi
HBoulmi

Reputation: 333

Join two dataframe using Spark Scala

I have this code :

   val o =    p_value.alias("d1").join(t_d.alias("d2"),
      (col("d1.origin_latitude")===col("d2.origin_latitude")&& 
      col("d1.origin_longitude")===col("d2.origin_longitude")),"left").
      filter(col("d2.origin_longitude").isNull)
   val c =    p_value2.alias("d3").join(o.alias("d4"),
      (col("d3.origin_latitude")===col("d4.origin_latitude") && 
       col("d3.origin_longitude")===col("d4.origin_longitude")),"left").
      filter(col("d3.origin_longitude").isNull)

I get this error :

Exception in thread "main" org.apache.spark.sql.AnalysisException: Reference 'd4.origin_latitude' is ambiguous, could be: d4.origin_latitude, d4.origin_latitude.;
at org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.resolve(package.scala:240)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:101)

On this line

 (col("d3.origin_latitude")===col("d4.origin_latitude") && col("d3.origin_longitude")===col("d4.origin_longitude")),"left").

Any idea ?

Thank you .

Upvotes: 0

Views: 337

Answers (1)

Belwal
Belwal

Reputation: 473

You are aliasing DataFrame not columns, which is used to access/refer columns in that DataFrame. So the first join will result into another DataFrame having same column name twice (origin_latitude as well as origin_longitude). Once you try to access one of these columns in resultant DataFrame, you are going to get Ambiguity error.

So you need to make sure that DataFrame contains each column only once. You can rewrite the first join as below:

p_value
      .join(t_d, Seq("origin_latitude", "origin_longitude"), "left")
      .filter(t_d.col("t_d.origin_longitude").isNull)

Upvotes: 1

Related Questions