Kratos
Kratos

Reputation: 1114

How to join datasets with same columns and select one?

I have two Spark dataframes which I am joining and selecting afterwards. I want to select a specific column of one of the Dataframes. But the same column name exists in the other one. Therefore I am getting an Exception for ambiguous column.

I have tried this:

d1.as("d1").join(d2.as("d2"), $"d1.id" === $"d2.id", "left").select($"d1.columnName")

and this:

d1.join(d2, d1("id") === d2("id"), "left").select($"d1.columnName")

but it does not work.

Upvotes: 7

Views: 31022

Answers (3)

Ryan Widmaier
Ryan Widmaier

Reputation: 8523

This happens because when spark combines the columns from the two DataFrames it doesn't do any automatic renaming for you. You just need to rename one of the columns before joining. Spark provides a method for this. After the join you can drop the renamed column.

val df2join = df2.withColumnRenamed("id", "join_id")
val joined = df1.join(df2, $"id" === $"join_id", "left").drop("join_id")

Upvotes: 4

Jacek Laskowski
Jacek Laskowski

Reputation: 74779

I have two dataframes

val d1 = spark.range(3).withColumn("columnName", lit("d1"))
scala> d1.printSchema
root
 |-- id: long (nullable = false)
 |-- columnName: string (nullable = false)

val d2 = spark.range(3).withColumn("columnName", lit("d2"))
scala> d2.printSchema
root
 |-- id: long (nullable = false)
 |-- columnName: string (nullable = false)

which I am joining and selecting afterwards. I want to select a specific column of one of the Dataframes. But the same column name exists in the other one.

val q1 = d1.as("d1")
  .join(d2.as("d2"), Seq("id"), "left")
  .select("d1.columnName")
scala> q1.show
+----------+
|columnName|
+----------+
|        d1|
|        d1|
|        d1|
+----------+

As you can see it just works.


So, why did it not work for you? Let's analyze each.

// you started very well
d1.as("d1")
  // but here you used $ to reference a column to join on
  // with column references by their aliases
  // that won't work
  .join(d2.as("d2"), $"d1.id" === $"d2.id", "left")
  // same here
  // $ + aliased columns won't work
  .select($"d1.columnName")

PROTIP: Use d1("columnName") to reference a specific column in a dataframe.

The other query was very close to be fine, but...

d1.join(d2, d1("id") === d2("id"), "left") // <-- so far so good!
  .select($"d1.columnName") // <-- that's the issue, i.e. $ + aliased column

Upvotes: 6

firsni
firsni

Reputation: 916

which spark version you're using ? can you put a sample of your dataframes ? try this:

d2prim = d2.withColumnRenamed("columnName", d2_columnName)   
d1.join(d2prim , Seq("id"), "left_outer").select("columnName")

Upvotes: 7

Related Questions