Ambigous reference when agg in Pyspark

Question

I'm trying to make a simple agg function in my output df. As you can see below, I try to sum a column.

I think the problem comes from the fact that both joins have inside the 'trx_detail_total_amount' column.

Here is the code used:

output = (df2
         .groupby()
         .agg(f.sum('trx_detail_total_amount').alias('total_gross_turnover'))
        )

The error that I get from Databricks:

AnalysisException: Reference 'trx_detail_total_amount' is ambiguous, could be: ...

I've tried the following solution:

output = (df2
         .groupby()
         .agg(f.sum('trx_detail_total_amount' as 'solution').alias('total_gross_turnover'))
        )

The error that I get from Databricks:

SyntaxError: invalid syntax

Any clue?

Thanks!

stack0114106 · Accepted Answer

Add an alias in your join step, before the aggregation. Example

df.columns
['id', 'subject', 'mark']

df1.columns
['id', 'mark']

Error replication:

dfj=df.join(df1,df["id"]==df["id"])
df2=dfj.groupBy().agg(sum('mark').alias('total_marks')).show(10,truncate=False)

org.apache.spark.sql.AnalysisException: Reference 'mark' is ambiguous, could be: mark, mark.

After adding alias:

dfj=df.alias("t1").join(df1.alias("t2"),df["id"]==df1["id"])
df2=dfj.groupBy().agg(sum('t1.mark').alias('total_marks')).show(10,truncate=False)

+-----------+
|total_marks|
+-----------+
|91055.0    |
+-----------+

Note that you can also do

dfj=df.alias("t1").join(df1.alias("t2"),col("t1.id")==col("t2.id"))

Ambigous reference when agg in Pyspark

Answers (2)

Related Questions