pinegulf
pinegulf

Reputation: 1396

Pyspark join multiplies common column

Good day,

I'm joining data on pySpark. Coming from SQL I like to define join by common key like this.

data_want = data_1.join(data_2, data_1.common_key == data_2.common_key , 'left' )
data_want.columns 

[<normal_columns>, common_key, common_key]

I get doubly entries of common_key-column. Very odd. When doing this with shorter syntax:

data_want = data_1.join(data_2, 'common_key' , 'left' )
data_want.columns 

[<normal_columns>, common_key]

All seems to be ok.

Can anyone explain what's going on here? Moreover, how would one go about writing the longer version, which I find more familiar. I can't seem to point to 2nd column with same name.

Running on DataBricks with Spark 3.2.1 and Scala 2.12

Upvotes: 0

Views: 52

Answers (1)

wwnde
wwnde

Reputation: 26676

In spark, each column has a unique id used by the catalyst/sql engine. The id is internal and more of a meta data and very hard to refer to. Join expressions do not eliminate the common_keys. Because they have common names, they double up. You cant drop them by calling their names. The id is hidden too. A common approach is therefore to lias the df to create a unique name. use the name to drop the column. Code below.

Another option is to rename the common columns and then join

data_1.alias('data_1').join(data_2, data_1.common_key == data_2.common_key , 'left' ).drop('data_1.common_key')

Upvotes: 1

Related Questions