user14863914
user14863914

Reputation:

Ambigous reference when agg in Pyspark

I'm trying to make a simple agg function in my output df. As you can see below, I try to sum a column.

I think the problem comes from the fact that both joins have inside the 'trx_detail_total_amount' column.

Here is the code used:

output = (df2
         .groupby()
         .agg(f.sum('trx_detail_total_amount').alias('total_gross_turnover'))
        )

The error that I get from Databricks:

AnalysisException: Reference 'trx_detail_total_amount' is ambiguous, could be: ...

I've tried the following solution:

output = (df2
         .groupby()
         .agg(f.sum('trx_detail_total_amount' as 'solution').alias('total_gross_turnover'))
        )

The error that I get from Databricks:

SyntaxError: invalid syntax

Any clue?

Thanks!

Upvotes: 0

Views: 2718

Answers (2)

stack0114106
stack0114106

Reputation: 8781

Add an alias in your join step, before the aggregation. Example

df.columns
['id', 'subject', 'mark']

df1.columns
['id', 'mark']

Error replication:

dfj=df.join(df1,df["id"]==df["id"])
df2=dfj.groupBy().agg(sum('mark').alias('total_marks')).show(10,truncate=False)

org.apache.spark.sql.AnalysisException: Reference 'mark' is ambiguous, could be: mark, mark.

After adding alias:

dfj=df.alias("t1").join(df1.alias("t2"),df["id"]==df1["id"])
df2=dfj.groupBy().agg(sum('t1.mark').alias('total_marks')).show(10,truncate=False)

+-----------+
|total_marks|
+-----------+
|91055.0    |
+-----------+

Note that you can also do

dfj=df.alias("t1").join(df1.alias("t2"),col("t1.id")==col("t2.id"))

Upvotes: 2

mck
mck

Reputation: 42422

I don't know how you joined the dataframe, but you can rename the column before joining the dataframe, something similar as below:

df1 = df1.withColumnRenamed('trx_detail_total_amount', 'trx_detail_total_amount_1')
df2 = df0.join(df1, ...)

output = (df2
         .groupby()
         .agg(f.sum('trx_detail_total_amount').alias('total_gross_turnover'))
         )

In this way you can avoid having duplicated column names in the joined dataframe df2.

Upvotes: 2

Related Questions