Reputation: 7
So, I have the following two dataframes:
DF1 looks like following:
+---------+-----------+
| Name | Money |
+---------+-----------+
| A | 50.3 |
| B | 26.9 |
| C | 11.4 |
| A | 35.8 |
| B | 59.2 |
| A | 90.8 |
| C | 23.5 |
| D | 23.5 |
| D | 54.6 |
| E | 78.0 |
| A | 12.3 |
| F | 20.3 |
| A | 57.1 |
+---------+-----------+
DF2 looks like following (list of unique names):
+---------+
| Name |
+---------+
| A |
| C |
| D |
+---------+
What kind of join will give me the following (only keeping A, C and D that appear in DF2):
+---------+-----------+
| Name | Money |
+---------+-----------+
| A | 50.3 |
| C | 11.4 |
| A | 35.8 |
| A | 90.8 |
| C | 23.5 |
| D | 23.5 |
| D | 54.6 |
| A | 12.3 |
| A | 57.1 |
+---------+-----------+
Upvotes: 0
Views: 1040
Reputation: 631
A simple inner join should solve your problem. Since you want only the values that are matched in both dfs, inner join will keep only the matched keys between the DF's.
df = df1.alias("t0").join(
df2.alias("t1"),
on=f.col("t0.Name") == f.col("t1.Name"),
how='inner'
).select(
f.col("t0.Name"),
f.col("t0.Money")
)
Upvotes: 1
Reputation: 1167
What you need is a LEFT SEMI join, which is really the equivalent of doing the SELECT * FROM df1 WHERE df1.Name IS IN df2.Name
- in PySpark code, it will look like this:
df1 = spark.createDataFrame([("A", 50.3), ("B", 26.9), ("C", 11.4), ("A", 35.8), ("B", 59.2), ("A", 90.8), ("C", 23.5), ("D", 23.5), ("D", 54.6), ("E", 78.0), ("A", 12.3), ("F", 20.3), ("A", 57.1)], "Name: string, Money: double")
df2 = spark.createDataFrame([("A",), ("C",), ("D",)], "Name: string")
df1.join(df2, "Name", how="left_semi").show()
+----+-----+
|Name|Money|
+----+-----+
| A| 50.3|
| A| 35.8|
| A| 90.8|
| A| 12.3|
| A| 57.1|
| C| 11.4|
| C| 23.5|
| D| 23.5|
| D| 54.6|
+----+-----+
You can read more in PySpark Docs
Upvotes: 0
Reputation: 3753
imo simple inner join will be ok here, it is going to filter records from df1 with names which do not exists in df2 which is what you expect
df1.join(df2,d1("name") === d2("name"),"inner")
In the title you asked about duplicates, duplicated record are going to stay there after inner join, if you want to remove them you can use distinct
you can find sample inner join below (inner is default type of join in Spark and that's why i just could skip 3rd parameter here)
Upvotes: 1