Up Ap
Up Ap

Reputation: 107

Spark Data frame Join: Non matching Records from first Dataframe

Hi all I have 2 Dataframes and I'm applying some join condition on those dataframes. 1.after join condition i want all the data from first dataframe whose name,id,code,lastname is not matching which second dataframe.I have written below code.

    val df3=df1.join(df2,df1("name") !==  df2("name_2")  && 
    df1("id") !== df2("id_2") &&
    df1("code") !==  df2("code_2") && 
    df1("lastname") !==  df2("lastname_2"),"inner")
    .drop(df2("id_2"))
    .drop(df2("name_2"))
    .drop(df2("code_2"))
    .drop(df2("lastname"))

expected result.

    DF1
    id,name,code,lastname
    1,A,001,p1
    2,B,002,p2
    3,C,003,p3

    DF2
    id_2,name_2,code_2,lastname_2
    1,A,001,p1
    2,B,002,p4
    4,D,004,p4


    DF3
    id,name,code,lastname
    3,C,003,p3

Can someone please help me is this the correct way to do this or Should I use sql inner query with 'not In '?. I am new to spark and using first time dataframe methods so I am not sure this is the correct way or not?

Upvotes: 0

Views: 3550

Answers (1)

MrElephant
MrElephant

Reputation: 312

I recommend you using Spark API to work with data:

        val df1 =
          Seq((1, "20181231"), (2, "20190102"), (3, "20190103"), (4, "20190104"), (5, "20190105")).toDF("id", "date")

        val df2 =
          Seq((1, "20181231"), (2, "20190102"), (4, "20190104"), (5, "20190105")).toDF("id", "date")

Option1. You can get all rows are not included in other dataframe:

    val df3=df1.except(df2)

Option2. You can use a specific fields to do anti join, for example 'id':

    val df3 = df1.as("table1").join(df2.as("table2"), $"table1.id" === $"table2.id", "leftanti")

    df3.show()

Upvotes: 1

Related Questions