org.apache.spark.sql.AnalysisException: cannot resolve

Question

Case 1 : When i'm trying to fetch "b.no" getting the error, code is shared below and also error message. How can i get the values from second dataframe (i.e aliased as b). whether selecting values from b is allowed or not here. If i remove b.no it is working fine.

df1.csv no,name,sal 1,sri,3000 2,ram,2000 3,sam,2500 4,kri,5000 5,tom,4000

df2.csv no,name,sal 1,sri,3000 1,vas,4000 2,ram,2000 3,sam,2500 4,kri,5000 5,tom,4500 5,toy,4200 5,koy,4999 6,jim,3090 7,kim,2080

code:

from pyspark.shell import spark
from pyspark.sql import SQLContext

sc = spark.sparkContext
sqlContext = SQLContext(sc)

df11 = spark.read.option("header","true").option("delimiter", ",").csv("C:\inputs\df1.csv")
df22 = spark.read.option("header","true").option("delimiter", ",").csv("C:\inputs\df2.csv")
print("df11", df11.count())
print("df22", df22.count())

resDF = df11.alias("a").join(df22.alias("b"), on='no').select("a.no", "a.name", "b.no")
print("resDF", resDF.count())
print("resDF", resDF.distinct().show())

Error:

py4j.protocol.Py4JJavaError: An error occurred while calling o48.select. : org.apache.spark.sql.AnalysisException: cannot resolve 'b.no' given input columns: [b.sal, a.no, b.name, a.sal, a.name];; pyspark.sql.utils.AnalysisException: "cannot resolve 'b.no' given input columns: [b.sal, a.no, b.name, a.sal, a.name];; 'Project [no#10, name#11, 'b.no] +- AnalysisBarrier +- Project [no#10, name#11, sal#12, name#27, sal#28] +- Join Inner, (no#10 = no#26) :- SubqueryAlias a : +- Relation[no#10,name#11,sal#12] csv +- SubqueryAlias b +- Relation[no#26,name#27,sal#28] csv "

Case 2: when i use b.sal getting duplicate values, its not filtering out.

    resDF = df11.alias("a").join(df22.alias("b"), on='no').select("a.no", "a.name", "b.sal")      
print("resDF", resDF.distinct().show())

In this case how to get distinct values based on 'no' only.

cronoik · Accepted Answer

The problem in case1 is that when you use a string (or arraytype) as join argument, spark will only add a.no and not b.no to avoid duplicate columns after a join (See link for more information). You can avoid this by defining a join expression like F.col('a.no') == col('b.no'). See full example below:

from pyspark.sql import types as T
from pyspark.sql import functions as F
columns1 = ['no','name','sal']
columns2 = ['no','name','sal']

vals1 = [(1,'sri',3000) ,(2,'ram',2000) ,(3,'sam',2500) ,(4,'kri',5000) ,(5,'tom',4000)]

vals2 = [(1,'sri',3000) ,(1,'vas',4000) ,(2,'ram',2000) ,(3,'sam',2500), (4,'kri',5000) ,(5,'tom',4500) ,(5,'toy',4200) ,(5,'koy',4999) ,(6,'jim',3090) ,(7,'kim',2080)]

df1 = spark.createDataFrame(vals1, columns1)
df2 = spark.createDataFrame(vals2, columns2)
#here I use a expression instead of a string
resDF = df1.alias("a").join(df2.alias("b"), F.col('a.no') == col('b.no')).select("a.no", "a.name", "b.no")
resDF.show()

Output:

+---+----+---+ 
| no|name| no| 
+---+----+---+ 
|  0|   1|  0| 
+---+----+---+

For your Case2: The dataframe distinct method compares each row of the dataframe. When you only need the unique values of one column you have to perform a select at first:

resDF = df1.alias("a").join(df2.alias("b"), F.col('a.no') == col('b.no')).select("a.no", "a.name", "b.sal")      
resDF.select('no').distinct().show()

org.apache.spark.sql.AnalysisException: cannot resolve

Answers (1)

Related Questions