Spark is not able to resolve the columns correctly when joins data frames

Question

I'm using pyspark ( python 3.8) over spark3.0 on Databricks. When running this DataFrame join:

next_df = (
    days_currencies_matrix.alias("a")
    .join(
        data_to_merge.alias("b"),
        [
            days_currencies_matrix.dt == data_to_merge.RATE_DATE,
            days_currencies_matrix.CURRENCY_CODE == data_to_merge.CURRENCY_CODE,
        ],
        "LEFT",
    )
    .select(
        days_currencies_matrix.CURRENCY_CODE,
        days_currencies_matrix.dt.alias("RATE_DATE"),
        data_to_merge.AVGYTD,
        data_to_merge.ENDMTH,
        data_to_merge.AVGMTH,
        data_to_merge.AVGWEEK,
        data_to_merge.AVGMTD,
    )
)

And I’m getting this error:

Column AVGYTD#67187, AVGWEEK#67190, ENDMTH#67188, AVGMTH#67189, AVGMTD#67191 are ambiguous. It's probably because you joined several Datasets together, and some of these Datasets are the same. This column points to one of the Datasets but Spark is unable to figure out which one. Please alias the Datasets with different names via Dataset.as before joining them, and specify the column using qualified name, e.g. df.as("a").join(df.as("b"), $"a.id" > $"b.id"). You can also set spark.sql.analyzer.failAmbiguousSelfJoin to false to disable this check.

Which is telling me that the above columns belong to more than one dataset. Why is that happening? The code is telling to spark exactly the source dataframe; also, the days_currencies_matrix has only 2 columns: dt and CURRENCY_CODE.

The source data frames are defined as follow:

currencies_in_merging_data = data_to_merge.select('CURRENCY_CODE').distinct()

days_currencies_matrix = dt_days.crossJoin(currencies_in_merging_data)

Is it because days_currencies_matrix DataFrame actually is built over the data_to_merge? Is that something related to Lazy evaluations or it is a bug?

BTW, this version works with no issues:

next_df = (
    days_currencies_matrix.alias("a")
    .join(
        data_to_merge.alias("b"),
        [
            days_currencies_matrix.dt == data_to_merge.RATE_DATE,
            days_currencies_matrix.CURRENCY_CODE == data_to_merge.CURRENCY_CODE,
        ],
        "LEFT",
    )
    .select(
        col("a.dt").alias("RATE_DATE"),
        col("a.CURRENCY_CODE"),
        col("b.AVGYTD"),
        col("b.ENDMTH"),
        col("b.AVGMTH"),
        col("b.AVGWEEK"),
        col("b.AVGMTD"),
    )
)

Spark is not able to resolve the columns correctly when joins data frames

Answers (1)

Related Questions