Reputation: 1861
I have two pyspark dataframes A and B, both containing the following schema:
StructField(email_address,StringType,true),StructField(subject_line,StringType,true)))
I want to create a combined dataframe that is union of A and B rows. The intersection of A and B should have the subject lines of A.
So if there's an email address thats present in both the dataframes, my final dataframe will only have a single record for it and the subject line for that record should come from A.
I am struggling with how to do that. I have tried joins and coalesce but no luck so far. Would appreciate some help.
UPDATE: I think I solved it by
df_B_minus_A =df_B.join(df_A, ["email_address"], "leftanti")
df_A.union(df_B_minus_A).count()
Let me know if there's a better way. Thanks
Upvotes: 0
Views: 793
Reputation: 26676
A
+-------------+------------+
|email_address|subject_line|
+-------------+------------+
| [email protected]| kesho|
| [email protected]| leo|
+-------------+------------+
B
+-------------+------------+
|email_address|subject_line|
+-------------+------------+
| [email protected]| Jana|
| [email protected]| leo|
+-------------+------------+
Many Ways;
Method 1, which I would discourage because its expensive is compute the left_anti join of B and A and then union outcome with A
A.unionByName(B.join(A, how='left_anti', on=['email_address','subject_line'])).show()
Method 2, UnionByName and drop duplicates
A.unionByName(B).dropDuplicates().show()
Method 3, use spark.sql whose union automatically drop duplicates of the second dataframe
A.createOrReplaceTempView('A')
B.createOrReplaceTempView('B')
spark.sql("""Select * FROM A
union
Select * FROM B""").show()
Outcome
+-------------+------------+
|email_address|subject_line|
+-------------+------------+
| [email protected]| kesho|
| [email protected]| leo|
| [email protected]| Jana|
+-------------+------------+
Upvotes: 1