Pyspark join multiplies common column

Question

Good day,

I'm joining data on pySpark. Coming from SQL I like to define join by common key like this.

data_want = data_1.join(data_2, data_1.common_key == data_2.common_key , 'left' )
data_want.columns

[, common_key, common_key]

I get doubly entries of common_key-column. Very odd. When doing this with shorter syntax:

data_want = data_1.join(data_2, 'common_key' , 'left' )
data_want.columns

[, common_key]

All seems to be ok.

Can anyone explain what's going on here? Moreover, how would one go about writing the longer version, which I find more familiar. I can't seem to point to 2nd column with same name.

Running on DataBricks with Spark 3.2.1 and Scala 2.12

Pyspark join multiplies common column

Answers (1)

Related Questions