Reputation: 107
Hi all I have 2 Dataframes and I'm applying some join condition on those dataframes. 1.after join condition i want all the data from first dataframe whose name,id,code,lastname is not matching which second dataframe.I have written below code.
val df3=df1.join(df2,df1("name") !== df2("name_2") &&
df1("id") !== df2("id_2") &&
df1("code") !== df2("code_2") &&
df1("lastname") !== df2("lastname_2"),"inner")
.drop(df2("id_2"))
.drop(df2("name_2"))
.drop(df2("code_2"))
.drop(df2("lastname"))
expected result.
DF1
id,name,code,lastname
1,A,001,p1
2,B,002,p2
3,C,003,p3
DF2
id_2,name_2,code_2,lastname_2
1,A,001,p1
2,B,002,p4
4,D,004,p4
DF3
id,name,code,lastname
3,C,003,p3
Can someone please help me is this the correct way to do this or Should I use sql inner query with 'not In '?. I am new to spark and using first time dataframe methods so I am not sure this is the correct way or not?
Upvotes: 0
Views: 3550
Reputation: 312
I recommend you using Spark API to work with data:
val df1 =
Seq((1, "20181231"), (2, "20190102"), (3, "20190103"), (4, "20190104"), (5, "20190105")).toDF("id", "date")
val df2 =
Seq((1, "20181231"), (2, "20190102"), (4, "20190104"), (5, "20190105")).toDF("id", "date")
Option1. You can get all rows are not included in other dataframe:
val df3=df1.except(df2)
Option2. You can use a specific fields to do anti join, for example 'id':
val df3 = df1.as("table1").join(df2.as("table2"), $"table1.id" === $"table2.id", "leftanti")
df3.show()
Upvotes: 1