Reputation: 27
I am facing a problem in my Databricks Delta Live Table (DLT) notebook. I am trying to join together two dataframes, of which one df is derived from the other, but I keep getting the following error:
"Column col1#8176 are ambiguous. It's probably because you joined several Datasets together, and some of these Datasets are the same. This column points to one of the Datasets but Spark is unable to figure out which one"
In order to narrow down the issue, I have simplified it as much as possible. The following code is enough to reproduce the error:
from pyspark.sql.types import StructType, StructField, IntegerType
from pyspark.sql.functions import lit
schema = StructType([StructField("col1", IntegerType(), True)])
data = [(1, )]
df1 = spark.createDataFrame(data, schema)
df2 = df1.withColumn("col2", lit(2))
df_join = df1.join(df2, df1["col1"] == df2["col2"], "inner")
I have been researching multiple hours for a solution to this error, and I found two suggestions so far. The first suggestion was to use aliases for the tables like this:
df1_alias = df1.alias("df1_alias")
df2_alias = df2.alias("df2_alias")
df_join = df1_alias.join(df2_alias, df1_alias["col1"] == df2_alias["col2"], "inner")
But unfortunately that still results in the same error.
The other suggestion was to recreate two new dataframes using rdd of the original:
df1 = df1.rdd.toDF()
df2 = df2.rdd.toDF()
df_join = df1.join(df2, df1["col1"] == df2["col2"], "inner")
That does resolve the original error, but when I try to run it as part of my DLT pipeline, I get the following error in the pipeline event log details:
"Method public org.apache.spark.rdd.RDD org.apache.spark.api.java.JavaRDD.rdd() is not whitelisted on class class org.apache.spark.api.java.JavaRDD".
I hope that someone here might have a suggestion.
Upvotes: 0
Views: 43
Reputation: 75
One simple trick I used is to rename all column names with a simple function. Something like df1 column names will have df1_col1 and df2 will have df2_col1. This is not efficient and will spam your execution dag but does the work of you have small dataset. Want to see if anyone have actual resolution
Upvotes: 0
Reputation: 998
Your final dataframe df_join
will have twice the column col1
. once from df1 and from df2. Here is a join that works:
df_join = df1.select("col1").join(df2.select("col2"), df1["col1"] == df2["col2"], "inner")
Upvotes: 0