Reputation: 1939
I have a dataframe that I am applying a lambda function to to copy over a row value based on the values of a column.
In Pandas it looks like this:
df1 = pd.DataFrame({'lkey': ['foo', 'bar', 'baz', 'foo'],
'value': ['one', 'two', 'three', 'five']})
df2 = pd.DataFrame({'rkey': ['foo', 'bar', 'baz', 'foo'],
'value': ['five', 'six', nan, nan]})
new_df = df1.merge(df2, how='left', left_on='lkey', right_on='rkey')
lkey value_x rkey value_y
0 foo one foo five
1 foo one foo NaN
2 bar two bar six
3 baz three baz NaN
4 foo five foo five
5 foo five foo NaN
def my_func(row):
if not row['value_y'] in [nan]:
row['value_x'] = row['value_y']
return row
applied_df = new_df.apply(lambda x: my_func(x), axis=1)
lkey value_x rkey value_y
0 foo five foo five
1 foo one foo NaN
2 bar six bar six
3 baz three baz NaN
4 foo five foo five
5 foo five foo NaN
How would I do something similar in Pyspark?
Upvotes: 1
Views: 417
Reputation: 8410
Try this:
from pyspark.sql import functions as F
df1.withColumnRenamed("value","value_x")\
.join(df2.withColumnRenamed("value","value_y"),F.col("lkey")==F.col("rkey"),'left')\
.withColumn("value_x", F.when(F.col("value_y").isNotNull(),F.col("value_y")).otherwise(F.col("value_x"))).show()
#+----+-------+----+-------+
#|lkey|value_x|rkey|value_y|
#+----+-------+----+-------+
#| bar| six| bar| six|
#| foo| five| foo| five|
#| foo| one| foo| null|
#| foo| five| foo| five|
#| foo| five| foo| null|
#| baz| three| baz| null|
#+----+-------+----+-------+
Upvotes: 4