Dan Siegel
Dan Siegel

Reputation: 73

Unable to resolve Column Name Spark

I have created 2 dataframe as below:

df_flights = spark1.read.parquet('domestic-flights\\flights.parquet')
df_airport_codes = spark1.read.load('domestic-flights\\flights.csv',format="csv",sep=",",inferSchema=True,header=True)

I then referenced the databricks guide to not get duplicate columns https://docs.databricks.com/spark/latest/faq/join-two-dataframes-duplicated-column.html

df3=df_flights.join(df_airport_codes,"origin_airport_code", 'left')

When I try to sort by any of the columns which were in both dataframes I am still getting the same error

Py4JJavaError: An error occurred while calling o1553.filter.

: org.apache.spark.sql.AnalysisException: Reference 'passengers' is ambiguous, could be: passengers, passengers.;

OR if I attempt a sort:

df3.sort('passengers')

Py4JJavaError: An error occurred while calling o1553.sort.: org.apache.spark.sql.AnalysisException: cannot resolve '`passengers`' given input columns: [flights, destination_population, origin_city, distance, passengers, seats, flights, origin_population, passengers, flight_datetime, origin_air_port_code, flight_year, seats, origin_city, destination_city, destination_city, destination_airport_code, destination_airport_code, origin_population, destination_population, flight_month, distance];;

The question is, is there an error with my join logic? If not, how do I alias the ambiguous column?

Upvotes: 1

Views: 1652

Answers (1)

Rakesh Kumar
Rakesh Kumar

Reputation: 4420

There is no error in your join. Both data frame have same column so your resultant dataframe contains ambiguous column names.

This is why sort by passengers produces exception. You need to sort by proper alias.

df3.sort(df_flights.passengers)

Or first select appropriate columns and sort. Like

df3.select(df_flights.passengers, df.origin_city, ......).sort("passengers").show()

In sort you need to be unique before any operation in spark.

Upvotes: 1

Related Questions