Rustam-Z
Rustam-Z

Reputation: 113

Overwrite the rows containing NULL in Dataframe by another Dataframe in PySpark

I have two dataframes df and df1, in df I have NULL in some columns, but in df1 I have non-null values for these columns. I just need to overwrite rows where the NULL exists.

The df is below:

+------------+--------------------+-------+---------------+--------------------+----------+------------+
|          Id|                Name|Country|           City|             Address|  Latitude|   Longitude|
+------------+--------------------+-------+---------------+--------------------+----------+------------+
| 42949672960|Americana Resort ...|     US|         Dillon|         135 Main St|      null|        null|
| 42949672965|Comfort Inn Delan...|     US|         Deland|400 E Internation...| 29.054737|  -81.297208|
| 60129542147|Ubaa Old Crawford...|     US|    Des Plaines|     5460 N River Rd|      null|        null|

The df1 is below:

+-------------+--------------------+-------+------------+--------------------+----------+------------+
|           Id|                Name|Country|        City|             Address|  Latitude|   Longitude|
+-------------+--------------------+-------+------------+--------------------+----------+------------+
|  42949672960|Americana Resort ...|     US|      Dillon|         135 Main St|39.6286685|-106.0451009|
|  60129542147|Ubaa Old Crawford...|     US| Des Plaines|     5460 N River Rd|42.0654049| -87.8916252|
+-------------+--------------------+-------+------------+--------------------+----------+------------+

I want this result:

+------------+--------------------+-------+---------------+--------------------+----------+------------+
|          Id|                Name|Country|           City|             Address|  Latitude|   Longitude|
+------------+--------------------+-------+---------------+--------------------+----------+------------+
| 42949672960|Americana Resort ...|     US|         Dillon|         135 Main St|39.6286685|-106.0451009|
| 42949672965|Comfort Inn Delan...|     US|         Deland|400 E Internation...| 29.054737|  -81.297208|
...
...

Upvotes: 2

Views: 305

Answers (1)

pltc
pltc

Reputation: 6082

You can either left join or inner join them then using coalesce to pick first non-null lat/lon.

df1
+-----------+---------+----------+
|         id|      lat|       lon|
+-----------+---------+----------+
|42949672960|     null|      null|
|42949672965|29.054737|-81.297208|
|60129542147|     null|      null|
+-----------+---------+----------+

df2
+-----------+----------+------------+
|         id|       lat|         lon|
+-----------+----------+------------+
|42949672960|39.6286685|-106.0451009|
|60129542147|42.0654049| -87.8916252|
+-----------+----------+------------+

Join them together

from pyspark.sql import functions as F

(df1
    .join(df2, on=['id'], how='left')
    .select(
        F.col('id'),
        F.coalesce(df1['lat'], df2['lat']).alias('lat'),
        F.coalesce(df1['lon'], df2['lon']).alias('lon')
    )
    .show()
)
# +-----------+----------+------------+
# |         id|       lat|         lon|
# +-----------+----------+------------+
# |42949672965| 29.054737|  -81.297208|
# |60129542147|42.0654049| -87.8916252|
# |42949672960|39.6286685|-106.0451009|
# +-----------+----------+------------+

Upvotes: 2

Related Questions