Pyspark SQL conditional join issues

Question

I am trying to conditionally join these two data sets using the joinConditional function below. I found a similar description for scala code, but for Python I cant get this to work. It keeps throwing a "Column not iterable" error. Im fairly new to Pyspark and could use some help.

Here is the dfviol dataframe

+------+------+
|Number|Letter|
+------+------+
|     5|     a|
|     2|     b|
|     3|     c|
|     6|     a|
|     7|     b|
+------+------+

here is the dfcent dataframe

+---+------+-----------------+------------------+------------------+-------------------+
| ID|Letter|Num_range_low_odd|Num_range_high_odd|Num_range_low_even|Num_range_high_even|
+---+------+-----------------+------------------+------------------+-------------------+
|  1|     a|                1|                 5|                 6|                 10|
|  2|     a|                7|                 9|                 0|                  4|
|  3|     a|               11|                15|                10|                 14|
|  4|     b|                1|                 5|                 0|                  4|
|  5|     b|                7|                 9|                 6|                 10|
|  6|     c|                1|                 5|                 0|                  4|
|  7|     c|                7|                 9|                 6|                 10|
|  8|     c|               11|                15|                10|                 14|
+---+------+-----------------+------------------+------------------+-------------------+

The idea is that for merging the two data frames I am going to use the number in dfviol, and if it is even use the even side of the dfcent to merge the frames, if it isnt, ill use the odd.

this is what I have got so far, but it says the Column is not iterable.

joinCondition = when(dfviol.Number%2== 0, [dfcent.Num_range_low_even <= dfviol.Number,dfcent.Num_range_high_even >= dfviol.Number]).otherwise([dfcent.Num_range_low_odd <= dfviol.Number,dfcent.Num_range_high_odd >= dfviol.Number])

df_full = dfviol.join(dfcent,[dfviol.Letter == dfcent.Letter, joinCondition], how='inner')
df_full.show()

I am not sure what I am doing wrong, any help would be fantastic

murtihash · Accepted Answer

You could chain multiple conditions with &.

from pyspark.sql.functions import when

joinCondition = when(dfviol.Number%2== 0, (dfcent.Num_range_low_even <= dfviol.Number)&(dfcent.Num_range_high_even >= dfviol.Number)).otherwise((dfcent.Num_range_low_odd <= dfviol.Number) & (dfcent.Num_range_high_odd >= dfviol.Number))

df_full = dfviol.join(dfcent,[dfviol.Letter == dfcent.Letter, joinCondition], how='inner')
df_full.show()

#+------+------+---+------+-----------------+------------------+------------------+-------------------+
#|Number|Letter| ID|Letter|Num_range_low_odd|Num_range_high_odd|Num_range_low_even|Num_range_high_even|
#+------+------+---+------+-----------------+------------------+------------------+-------------------+
#|     3|     c|  6|     c|                1|                 5|                 0|                  4|
#|     2|     b|  4|     b|                1|                 5|                 0|                  4|
#|     7|     b|  5|     b|                7|                 9|                 6|                 10|
#|     5|     a|  1|     a|                1|                 5|                 6|                 10|
#|     6|     a|  1|     a|                1|                 5|                 6|                 10|
#+------+------+---+------+-----------------+------------------+------------------+-------------------+

Pyspark SQL conditional join issues

Answers (1)

Related Questions