Unable to resolve Column Name Spark

Question

I have created 2 dataframe as below:

df_flights = spark1.read.parquet('domestic-flights\flights.parquet')
df_airport_codes = spark1.read.load('domestic-flights\flights.csv',format="csv",sep=",",inferSchema=True,header=True)

I then referenced the databricks guide to not get duplicate columns https://docs.databricks.com/spark/latest/faq/join-two-dataframes-duplicated-column.html

df3=df_flights.join(df_airport_codes,"origin_airport_code", 'left')

When I try to sort by any of the columns which were in both dataframes I am still getting the same error

Py4JJavaError: An error occurred while calling o1553.filter.

: org.apache.spark.sql.AnalysisException: Reference 'passengers' is ambiguous, could be: passengers, passengers.;

OR if I attempt a sort:

df3.sort('passengers')

Py4JJavaError: An error occurred while calling o1553.sort.: org.apache.spark.sql.AnalysisException: cannot resolve '`passengers`' given input columns: [flights, destination_population, origin_city, distance, passengers, seats, flights, origin_population, passengers, flight_datetime, origin_air_port_code, flight_year, seats, origin_city, destination_city, destination_city, destination_airport_code, destination_airport_code, origin_population, destination_population, flight_month, distance];;

The question is, is there an error with my join logic? If not, how do I alias the ambiguous column?

Rakesh Kumar · Accepted Answer

There is no error in your join. Both data frame have same column so your resultant dataframe contains ambiguous column names.

This is why sort by passengers produces exception. You need to sort by proper alias.

df3.sort(df_flights.passengers)

Or first select appropriate columns and sort. Like

df3.select(df_flights.passengers, df.origin_city, ......).sort("passengers").show()

In sort you need to be unique before any operation in spark.

Unable to resolve Column Name Spark

Answers (1)

Related Questions