PySpark Dataframes: Full Outer Join with a condition

Question

I have the following 2 dataframes-

dataframe_a
+----------------+---------------+
|         user_id|         domain|
+----------------+---------------+
|            josh|     wanadoo.fr|
|        samantha|     randomn.fr|
|             bob|    eidsiva.net|
|           dylan|    vodafone.it|
+----------------+---------------+

dataframe_b
+----------------+---------------+
|         user_id|         domain|
+----------------+---------------+
|            josh|  oldwebsite.fr|
|        samantha|     randomn.fr|
|           dylan|      oldweb.it|
|            ryan|      chicks.it|
+----------------+---------------+

I want to do a full outer join but retain the value from the domain column of dataframe_a in cases where I get 2 different domains for a single user_id. So, my desired dataframe would look like-

desired_df
+----------------+---------------+
|         user_id|         domain|
+----------------+---------------+
|            josh|     wanadoo.fr|
|        samantha|     randomn.fr|
|             bob|    eidsiva.net|
|           dylan|    vodafone.it|
|            ryan|      chicks.it|
+----------------+---------------+

I think I can do something like-

desired_df = dataframe_a.join(dataframe_b, ["user_id"], how="full_outer").drop(dataframe_b.domain)

But I'm worried if this will give me ryan in my desired dataframe or not. Is this the right way?

Joey Lesh · Accepted Answer

You will want to use 'coalesce'. In your current solution, ryan will be in the resulting dataframe, but with a null value for the remaining dataframe_a.domain column.

joined_df = dataframe_a.join(dataframe_b, ["user_id"], how="full_outer")

+----------------+---------------+---------------+
|         user_id|         domain|         domain|
+----------------+---------------+---------------+
|            josh|     wanadoo.fr|  oldwebsite.fr|
|        samantha|     randomn.fr|     randomn.fr|
|             bob|    eidsiva.net|               |
|           dylan|    vodafone.it|      oldweb.it|
|            ryan|               |      chicks.it|
+----------------+---------------+---------------+

'coalesce' allows you to specific a preference order, but skips null values.

import pyspark.sql.functions as F
joined_df = joined_df.withColumn(
  "preferred_domain",
  F.coalesce(dataframe_a.domain, dataframe_b.domain)
)
joined_df = joined_df.drop(dataframe_a.domain).drop(dataframe_b.domain)

Giving

+----------------+----------------+
|         user_id|preferred_domain|
+----------------+----------------+
|            josh|      wanadoo.fr|
|        samantha|      randomn.fr|
|             bob|     eidsiva.net|
|           dylan|     vodafone.it|
|            ryan|       chicks.it|
+----------------+----------------+

PySpark Dataframes: Full Outer Join with a condition

Answers (2)

Related Questions