Reputation: 1396
Good day,
I'm joining data on pySpark. Coming from SQL I like to define join by common key like this.
data_want = data_1.join(data_2, data_1.common_key == data_2.common_key , 'left' )
data_want.columns
[<normal_columns>, common_key, common_key]
I get doubly entries of common_key-column. Very odd. When doing this with shorter syntax:
data_want = data_1.join(data_2, 'common_key' , 'left' )
data_want.columns
[<normal_columns>, common_key]
All seems to be ok.
Can anyone explain what's going on here? Moreover, how would one go about writing the longer version, which I find more familiar. I can't seem to point to 2nd column with same name.
Running on DataBricks with Spark 3.2.1 and Scala 2.12
Upvotes: 0
Views: 52
Reputation: 26676
In spark, each column has a unique id used by the catalyst/sql engine. The id is internal and more of a meta data and very hard to refer to. Join expressions do not eliminate the common_keys. Because they have common names, they double up. You cant drop them by calling their names. The id is hidden too. A common approach is therefore to lias the df to create a unique name. use the name to drop the column. Code below.
Another option is to rename the common columns and then join
data_1.alias('data_1').join(data_2, data_1.common_key == data_2.common_key , 'left' ).drop('data_1.common_key')
Upvotes: 1