Gravity Boy
Gravity Boy

Reputation: 7

Will pyspark inner join keep duplicates in the first dataset in the following example?

So, I have the following two dataframes:

DF1 looks like following:

+---------+-----------+
|  Name   |   Money   |
+---------+-----------+
|    A    |   50.3    |
|    B    |   26.9    |
|    C    |   11.4    |
|    A    |   35.8    |
|    B    |   59.2    |
|    A    |   90.8    |
|    C    |   23.5    |
|    D    |   23.5    |
|    D    |   54.6    |
|    E    |   78.0    |
|    A    |   12.3    | 
|    F    |   20.3    |
|    A    |   57.1    | 
+---------+-----------+

DF2 looks like following (list of unique names):

+---------+
|  Name   |
+---------+
|    A    |
|    C    |
|    D    |
+---------+

What kind of join will give me the following (only keeping A, C and D that appear in DF2):

+---------+-----------+
|  Name   |   Money   |
+---------+-----------+
|    A    |   50.3    |
|    C    |   11.4    |
|    A    |   35.8    |
|    A    |   90.8    |
|    C    |   23.5    |
|    D    |   23.5    |
|    D    |   54.6    |
|    A    |   12.3    | 
|    A    |   57.1    | 
+---------+-----------+

Upvotes: 0

Views: 1040

Answers (3)

OdiumPura
OdiumPura

Reputation: 631

A simple inner join should solve your problem. Since you want only the values that are matched in both dfs, inner join will keep only the matched keys between the DF's.

df = df1.alias("t0").join(
        df2.alias("t1"),
        on=f.col("t0.Name") == f.col("t1.Name"),
        how='inner'
    ).select(
        f.col("t0.Name"),
        f.col("t0.Money")
    )

Upvotes: 1

Bartosz Gajda
Bartosz Gajda

Reputation: 1167

What you need is a LEFT SEMI join, which is really the equivalent of doing the SELECT * FROM df1 WHERE df1.Name IS IN df2.Name - in PySpark code, it will look like this:

df1 = spark.createDataFrame([("A", 50.3), ("B", 26.9), ("C", 11.4), ("A", 35.8), ("B", 59.2), ("A", 90.8), ("C", 23.5), ("D", 23.5), ("D", 54.6), ("E", 78.0), ("A", 12.3), ("F", 20.3), ("A", 57.1)], "Name: string, Money: double")
df2 = spark.createDataFrame([("A",), ("C",), ("D",)], "Name: string")

df1.join(df2, "Name", how="left_semi").show()
+----+-----+
|Name|Money|
+----+-----+
|   A| 50.3|
|   A| 35.8|
|   A| 90.8|
|   A| 12.3|
|   A| 57.1|
|   C| 11.4|
|   C| 23.5|
|   D| 23.5|
|   D| 54.6|
+----+-----+

You can read more in PySpark Docs

Upvotes: 0

M_S
M_S

Reputation: 3753

imo simple inner join will be ok here, it is going to filter records from df1 with names which do not exists in df2 which is what you expect

df1.join(df2,d1("name") === d2("name"),"inner")

In the title you asked about duplicates, duplicated record are going to stay there after inner join, if you want to remove them you can use distinct

you can find sample inner join below (inner is default type of join in Spark and that's why i just could skip 3rd parameter here)

)

Upvotes: 1

Related Questions