Reputation: 168
I am trying to iterate over rows in one PySpark dataframe, and use values in each row to perform operations (filter, select) on a 2nd Pyspark dataframe, then bind all results. Perhaps this is best illustrated:
DF1
id name which_col
1 John col1
2 Jane col3
3 Bob col2
4 Barb col1
DF2
name col1 col2 col3
Bob 78 43 54
Bob 23 65 34
Bob 12 98 75
John 45 54 54
John 75 43 12
Jane 24 45 21
...
The steps I'd like to perform for every row in DF1 are:
Upvotes: 0
Views: 334
Reputation: 42352
You can unpivot (stack) df2 before joining:
result = df1.join(
df2.selectExpr(
'name',
'stack(3, ' + ', '.join(["'%s', %s" % (c, c) for c in df2.columns[1:]]) + ') as (which_col, value)'
),
['name', 'which_col'],
'left'
)
result.show()
+----+---------+---+-----+
|name|which_col| id|value|
+----+---------+---+-----+
|John| col1| 1| 75|
|John| col1| 1| 45|
|Jane| col3| 2| 21|
| Bob| col2| 3| 98|
| Bob| col2| 3| 65|
| Bob| col2| 3| 43|
|Barb| col1| 4| null|
+----+---------+---+-----+
Upvotes: 3