Steve
Steve

Reputation: 923

pySpark .join() with different column names and can't be hard coded before runtime

I get this final = ta.join(tb, on=['ID'], how='left') both left an right have a 'ID' column of the same name.

And I get this final = ta.join(tb, ta.leftColName == tb.rightColName, how='left') The left & right column names are known before runtime so the column names can be hard coded.

But what if the left and right column names of the on predicate are different and are calculated/ derived by configuration variables? Such as:

1) leftColName = 'leftKey'

2) rightColName = 'rightKey'

3) final = ta.join(tb, ta.leftColname == tb.rightColname, how='left')

The values of leftColName & rightColName are not know before line 3 can be hardcoded and executed.

This doesn't work because I find runtime can intermittently get confused/lost in whether rightColName refers to ta or to tb

final = ta.join(tb, f.col(leftColName) == f.col(rightColName), 'left')

Scala appears to have a facility to enable this.

Upvotes: 2

Views: 12937

Answers (2)

Sung Hoon Yoon
Sung Hoon Yoon

Reputation: 1

If doing on multiple columns, then following list comprehension works

df1 = df10_fj.select(*(f.col(x).alias(x + "_df10") for x in df10_fj.columns))
df2 = df4_fj.select(*(f.col(x).alias(x + "_df4") for x in df4_fj.columns))

# df1, on=[df1.DOB_df10 == df2.DOB_df4, df1.soundex_lnm_df10 == df2.soundex_lnm_df4]

df = df2.join(df1, on=[df1[f"{c}_df10"] == df2[f"{c}_df4"] for c in join_cols])```

Upvotes: 0

Gerold Busch
Gerold Busch

Reputation: 221

You are referencing the column as ta.leftColname, but - similarly to Pandas - you could also reference it by ta["leftColname"].

This way, instead of a hardcoded column name, you can also use a variable. For example:

left_key = 'leftColname'
right_key = 'rightColname'
final = ta.join(tb, ta[left_key] == tb[right_key], how='left')

Upvotes: 11

Related Questions