Reputation: 59
I am trying to identify the duplicates on column b and c for different id_number within the same group. I tried to use window partitioned by group and id_number but couldn't get through the situation in group d, which is omitting the duplicate that happens for the same id_number.
Here is the code to create the spark df and the outcome I am looking for.
import pandas as pd
pddf = pd.DataFrame({"group": ["a", "a", "a", "a", "b", "b", "b", 'c', 'c', 'd', 'd', 'd'],
"id_number": ["12", "12", "13", "13", "16", "16", "17", '20', '21', '22', '22', '23'],
"b": [1000, 2000, 1000, 3000, 1100, 1300, 1100, 1000, 1100, 2000, 2000, 2100],
"c": ['F', 'D', 'F', 'A', 'B','C','B', 'B', 'B', 'A', 'A', 'B']
})
df = spark.createDataFrame(pddf)
df.show()
+-----+---------+----+---+
|group|id_number| b| c|
+-----+---------+----+---+
| a| 12|1000| F|
| a| 12|2000| D|
| a| 13|1000| F|
| a| 13|3000| A|
| b| 16|1100| B|
| b| 16|1300| C|
| b| 17|1100| B|
| c| 20|1000| B|
| c| 21|1100| B|
| d| 22|2000| A|
| d| 22|2000| A|
| d| 23|2100| B|
+-----+---------+----+---+
Desire outcome:
+-----+---------+----+---+-----+
|group|id_number| b| c| dups|
+-----+---------+----+---+-----+
| a| 12|1000| F| 1|
| a| 12|2000| D| 0|
| a| 13|1000| F| 1|
| a| 13|3000| A| 0|
| b| 16|1100| B| 1|
| b| 16|1300| C| 0|
| b| 17|1100| B| 1|
| c| 20|1000| B| 0|
| c| 21|1100| B| 0|
| d| 22|2000| A| 0|
| d| 22|2000| A| 0|
| d| 23|2100| B| 0|
+-----+---------+----+---+-----+
Upvotes: 0
Views: 76
Reputation: 1857
I think you're looking for something like this:
import pyspark.sql.functions as f
df = (
df
.withColumn('dups', f.count(f.col('id_number')).over(Window.partitionBy('group', 'b', 'c')) - 1)
)
Upvotes: 1