pySpark .join() with different column names and can't be hard coded before runtime

Question

I get this final = ta.join(tb, on=['ID'], how='left') both left an right have a 'ID' column of the same name.

And I get this final = ta.join(tb, ta.leftColName == tb.rightColName, how='left') The left & right column names are known before runtime so the column names can be hard coded.

But what if the left and right column names of the on predicate are different and are calculated/ derived by configuration variables? Such as:

1) leftColName = 'leftKey'

2) rightColName = 'rightKey'

3) final = ta.join(tb, ta.leftColname == tb.rightColname, how='left')

The values of leftColName & rightColName are not know before line 3 can be hardcoded and executed.

This doesn't work because I find runtime can intermittently get confused/lost in whether rightColName refers to ta or to tb

final = ta.join(tb, f.col(leftColName) == f.col(rightColName), 'left')

Scala appears to have a facility to enable this.

Gerold Busch · Accepted Answer

You are referencing the column as ta.leftColname, but - similarly to Pandas - you could also reference it by ta["leftColname"].

This way, instead of a hardcoded column name, you can also use a variable. For example:

left_key = 'leftColname'
right_key = 'rightColname'
final = ta.join(tb, ta[left_key] == tb[right_key], how='left')

pySpark .join() with different column names and can't be hard coded before runtime

Answers (2)

Related Questions

pySpark .join() with different column names and can&#39;t be hard coded before runtime

Answers (2)

Related Questions

pySpark .join() with different column names and can't be hard coded before runtime