Reputation: 813
I am trying to find a way to join two df that have a certain colName
common amongst them, and fill the null
values with new values by using a func()
I have two DFs
DF1:
+---------------+
| colName|
+---------------+
| a|
| b|
| c|
| d|
+---------------+
and DF2:
+-------+-------------+
|colName| uuid|
+-------+-------------+
| a| 1|
| b| 2|
+-------+-------------+
After using a specific pyspark funciton like DF1.join(DF2, where(DF1.colName == DF2.colName))
or df.merge()
I'd like to obtain a df with the following values
DF1 + DF2:
+-------+-------------+
|colName| uuid|
+-------+-------------+
| a| 1|
| b| 2|
| c| null|
| d| null|
+-------+-------------+
and then using a func()
:
def func():
doSomethin(DF1+DF2)
The following result is expected:
+-------+-------------+
|colName| uuid|
+-------+-------------+
| a| 1|
| b| 2|
| c| 3|
| d| 4|
+-------+-------------+
As you can see, DF1
just has colName
, whereas DF2
has uuid
associated. So func()
will actually generate a uuid and fill distinct colName
with distinct uuid
. Any approach towards this would be appreciated. Thanks in advance.
Upvotes: 0
Views: 244
Reputation: 13541
You can do this with the uuid
function simply.
import pyspark.sql.functions as f
df1.join(df2, ['colName'], 'left') \
.withColumn('uuid', f.coalesce(f.col('uuid'), f.expr('uuid()'))) \
.show(10, False)
+-------+------------------------------------+
|colName|uuid |
+-------+------------------------------------+
|a |1 |
|b |2 |
|c |ad037b01-4be8-40f2-99fc-b7589b2c648c|
|d |e59c9315-b626-4dcf-a80a-e16f08f19a96|
+-------+------------------------------------+
Upvotes: 1