create a column Identify duplicate on certain columns within a pyspark window

Question

I am trying to identify the duplicates on column b and c for different id_number within the same group. I tried to use window partitioned by group and id_number but couldn't get through the situation in group d, which is omitting the duplicate that happens for the same id_number.

Here is the code to create the spark df and the outcome I am looking for.

import pandas as pd
pddf = pd.DataFrame({"group": ["a", "a", "a", "a", "b", "b", "b", 'c', 'c', 'd', 'd', 'd'], 
                    "id_number": ["12", "12", "13", "13", "16", "16", "17", '20', '21', '22', '22', '23'], 
                    "b": [1000, 2000, 1000, 3000, 1100, 1300, 1100, 1000, 1100, 2000, 2000, 2100],
                    "c": ['F', 'D', 'F', 'A', 'B','C','B', 'B', 'B', 'A', 'A', 'B']
                    })
df = spark.createDataFrame(pddf)
df.show()

+-----+---------+----+---+
|group|id_number|   b|  c|
+-----+---------+----+---+
|    a|       12|1000|  F|
|    a|       12|2000|  D|
|    a|       13|1000|  F|
|    a|       13|3000|  A|
|    b|       16|1100|  B|
|    b|       16|1300|  C|
|    b|       17|1100|  B|
|    c|       20|1000|  B|
|    c|       21|1100|  B|
|    d|       22|2000|  A|
|    d|       22|2000|  A|
|    d|       23|2100|  B|
+-----+---------+----+---+

Desire outcome:

+-----+---------+----+---+-----+
|group|id_number|   b|  c| dups|
+-----+---------+----+---+-----+
|    a|       12|1000|  F|    1|
|    a|       12|2000|  D|    0|
|    a|       13|1000|  F|    1|
|    a|       13|3000|  A|    0|
|    b|       16|1100|  B|    1|
|    b|       16|1300|  C|    0|
|    b|       17|1100|  B|    1|
|    c|       20|1000|  B|    0|
|    c|       21|1100|  B|    0|
|    d|       22|2000|  A|    0|
|    d|       22|2000|  A|    0|
|    d|       23|2100|  B|    0|
+-----+---------+----+---+-----+

create a column Identify duplicate on certain columns within a pyspark window

Answers (1)

Related Questions