Reputation: 73
I have created 2 dataframe as below:
df_flights = spark1.read.parquet('domestic-flights\\flights.parquet')
df_airport_codes = spark1.read.load('domestic-flights\\flights.csv',format="csv",sep=",",inferSchema=True,header=True)
I then referenced the databricks guide to not get duplicate columns https://docs.databricks.com/spark/latest/faq/join-two-dataframes-duplicated-column.html
df3=df_flights.join(df_airport_codes,"origin_airport_code", 'left')
When I try to sort by any of the columns which were in both dataframes I am still getting the same error
Py4JJavaError: An error occurred while calling o1553.filter.
: org.apache.spark.sql.AnalysisException: Reference 'passengers' is ambiguous, could be: passengers, passengers.;
OR if I attempt a sort:
df3.sort('passengers')
Py4JJavaError: An error occurred while calling o1553.sort.: org.apache.spark.sql.AnalysisException: cannot resolve '`passengers`' given input columns: [flights, destination_population, origin_city, distance, passengers, seats, flights, origin_population, passengers, flight_datetime, origin_air_port_code, flight_year, seats, origin_city, destination_city, destination_city, destination_airport_code, destination_airport_code, origin_population, destination_population, flight_month, distance];;
The question is, is there an error with my join logic? If not, how do I alias the ambiguous column?
Upvotes: 1
Views: 1652
Reputation: 4420
There is no error in your join. Both data frame have same column so your resultant dataframe contains ambiguous column names.
This is why sort by passengers produces exception. You need to sort by proper alias.
df3.sort(df_flights.passengers)
Or first select appropriate columns and sort. Like
df3.select(df_flights.passengers, df.origin_city, ......).sort("passengers").show()
In sort you need to be unique before any operation in spark.
Upvotes: 1