pyspark: union of two dataframes with intersecting rows getting values from the first dataframe

Question

I have two pyspark dataframes A and B, both containing the following schema:

StructField(email_address,StringType,true),StructField(subject_line,StringType,true)))

I want to create a combined dataframe that is union of A and B rows. The intersection of A and B should have the subject lines of A.

So if there's an email address thats present in both the dataframes, my final dataframe will only have a single record for it and the subject line for that record should come from A.

I am struggling with how to do that. I have tried joins and coalesce but no luck so far. Would appreciate some help.

UPDATE: I think I solved it by

df_B_minus_A =df_B.join(df_A, ["email_address"], "leftanti")
df_A.union(df_B_minus_A).count()

Let me know if there's a better way. Thanks

wwnde · Accepted Answer

A

+-------------+------------+
|email_address|subject_line|
+-------------+------------+
|   as@ght.xcv|       kesho|
|  as@ght.dfrt|         leo|
+-------------+------------+

B

+-------------+------------+
|email_address|subject_line|
+-------------+------------+
|   as@ght.xcv|        Jana|
|  as@ght.dfrt|         leo|
+-------------+------------+

Many Ways;

Method 1, which I would discourage because its expensive is compute the left_anti join of B and A and then union outcome with A

A.unionByName(B.join(A, how='left_anti', on=['email_address','subject_line'])).show()

Method 2, UnionByName and drop duplicates

A.unionByName(B).dropDuplicates().show()

Method 3, use spark.sql whose union automatically drop duplicates of the second dataframe

A.createOrReplaceTempView('A')
B.createOrReplaceTempView('B')

spark.sql("""Select * FROM A
             union
             Select * FROM B""").show()

Outcome

+-------------+------------+
|email_address|subject_line|
+-------------+------------+
|   as@ght.xcv|       kesho|
|  as@ght.dfrt|         leo|
|   as@ght.xcv|        Jana|
+-------------+------------+

pyspark: union of two dataframes with intersecting rows getting values from the first dataframe

Answers (1)

Related Questions