Avoiding column duplicate column names when joining two data frames in PySpark

Question

I have the following code:

from pyspark.sql import SQLContext
ctx = SQLContext(sc)
a = ctx.createDataFrame([("1","a",1),("2","a",1),("3","a",0),("4","a",0),("5","b",1),("6","b",0),("7","b",1)],["id","group","value1"])
b = ctx.createDataFrame([("1","a",8),("2","a",1),("3","a",1),("4","a",2),("5","b",1),("6","b",3),("7","b",4)],["id","group","value2"])
c = a.join(b,"id")
c.select("group")

It returns an error:

pyspark.sql.utils.AnalysisException: Reference 'group' is ambiguous, could be: group#1406, group#1409.;

The problem is that c has twice the same column "group":

>>> c.columns
['id', 'group', 'value1', 'group', 'value2']

I would like to be able to do c.select("a.group") for example but I don't know how to have the column names automatically adjusted when doing the join.

Mariusz · Accepted Answer

Just remove quotes: c.select(a.group) and it will select group column from a dataframe.

Avoiding column duplicate column names when joining two data frames in PySpark

Answers (1)

Related Questions