Reputation: 3208
I'm using pyspark ( python 3.8) over spark3.0 on Databricks. When running this DataFrame join:
next_df = (
days_currencies_matrix.alias("a")
.join(
data_to_merge.alias("b"),
[
days_currencies_matrix.dt == data_to_merge.RATE_DATE,
days_currencies_matrix.CURRENCY_CODE == data_to_merge.CURRENCY_CODE,
],
"LEFT",
)
.select(
days_currencies_matrix.CURRENCY_CODE,
days_currencies_matrix.dt.alias("RATE_DATE"),
data_to_merge.AVGYTD,
data_to_merge.ENDMTH,
data_to_merge.AVGMTH,
data_to_merge.AVGWEEK,
data_to_merge.AVGMTD,
)
)
And I’m getting this error:
Column AVGYTD#67187, AVGWEEK#67190, ENDMTH#67188, AVGMTH#67189, AVGMTD#67191 are ambiguous. It's probably because you joined several Datasets together, and some of these Datasets are the same. This column points to one of the Datasets but Spark is unable to figure out which one. Please alias the Datasets with different names via
Dataset.as
before joining them, and specify the column using qualified name, e.g.df.as("a").join(df.as("b"), $"a.id" > $"b.id")
. You can also set spark.sql.analyzer.failAmbiguousSelfJoin to false to disable this check.
Which is telling me that the above columns belong to more than one dataset. Why is that happening? The code is telling to spark exactly the source dataframe; also, the days_currencies_matrix has only 2 columns: dt and CURRENCY_CODE.
The source data frames are defined as follow:
currencies_in_merging_data = data_to_merge.select('CURRENCY_CODE').distinct()
days_currencies_matrix = dt_days.crossJoin(currencies_in_merging_data)
Is it because days_currencies_matrix DataFrame actually is built over the data_to_merge? Is that something related to Lazy evaluations or it is a bug?
BTW, this version works with no issues:
next_df = (
days_currencies_matrix.alias("a")
.join(
data_to_merge.alias("b"),
[
days_currencies_matrix.dt == data_to_merge.RATE_DATE,
days_currencies_matrix.CURRENCY_CODE == data_to_merge.CURRENCY_CODE,
],
"LEFT",
)
.select(
col("a.dt").alias("RATE_DATE"),
col("a.CURRENCY_CODE"),
col("b.AVGYTD"),
col("b.ENDMTH"),
col("b.AVGMTH"),
col("b.AVGWEEK"),
col("b.AVGMTD"),
)
)
Upvotes: 3
Views: 13024
Reputation: 3208
I found the problem.
The 1st select()
is about the next_df. In fact, in the first case I'm referencing the columns using the joining data frame names, which are not inside the final result.
In the second code I'm correctly referencing the columns using the name assigned to them by join
, and these are the correct names.
BTW:
This works too:
next_df = days_currencies_matrix.alias('a').join( data_to_merge.alias('b') , [
days_currencies_matrix.dt == data_to_merge.RATE_DATE,
days_currencies_matrix.CURRENCY_CODE == data_to_merge.CURRENCY_CODE ], 'LEFT')
next_df = next_df.select(next_df.AVGYTD, next_df.AVGWEEK, next_df.ENDMTH)
Upvotes: 1