Reputation:
I'm trying to make a simple agg function in my output df. As you can see below, I try to sum a column.
I think the problem comes from the fact that both joins have inside the 'trx_detail_total_amount' column.
Here is the code used:
output = (df2
.groupby()
.agg(f.sum('trx_detail_total_amount').alias('total_gross_turnover'))
)
The error that I get from Databricks:
AnalysisException: Reference 'trx_detail_total_amount' is ambiguous, could be: ...
I've tried the following solution:
output = (df2
.groupby()
.agg(f.sum('trx_detail_total_amount' as 'solution').alias('total_gross_turnover'))
)
The error that I get from Databricks:
SyntaxError: invalid syntax
Any clue?
Thanks!
Upvotes: 0
Views: 2718
Reputation: 8781
Add an alias in your join step, before the aggregation. Example
df.columns
['id', 'subject', 'mark']
df1.columns
['id', 'mark']
Error replication:
dfj=df.join(df1,df["id"]==df["id"])
df2=dfj.groupBy().agg(sum('mark').alias('total_marks')).show(10,truncate=False)
org.apache.spark.sql.AnalysisException: Reference 'mark' is ambiguous, could be: mark, mark.
After adding alias:
dfj=df.alias("t1").join(df1.alias("t2"),df["id"]==df1["id"])
df2=dfj.groupBy().agg(sum('t1.mark').alias('total_marks')).show(10,truncate=False)
+-----------+
|total_marks|
+-----------+
|91055.0 |
+-----------+
Note that you can also do
dfj=df.alias("t1").join(df1.alias("t2"),col("t1.id")==col("t2.id"))
Upvotes: 2
Reputation: 42422
I don't know how you joined the dataframe, but you can rename the column before joining the dataframe, something similar as below:
df1 = df1.withColumnRenamed('trx_detail_total_amount', 'trx_detail_total_amount_1')
df2 = df0.join(df1, ...)
output = (df2
.groupby()
.agg(f.sum('trx_detail_total_amount').alias('total_gross_turnover'))
)
In this way you can avoid having duplicated column names in the joined dataframe df2
.
Upvotes: 2