user10969675
user10969675

Reputation: 59

create a column Identify duplicate on certain columns within a pyspark window

I am trying to identify the duplicates on column b and c for different id_number within the same group. I tried to use window partitioned by group and id_number but couldn't get through the situation in group d, which is omitting the duplicate that happens for the same id_number.

Here is the code to create the spark df and the outcome I am looking for.

import pandas as pd
pddf = pd.DataFrame({"group": ["a", "a", "a", "a", "b", "b", "b", 'c', 'c', 'd', 'd', 'd'], 
                    "id_number": ["12", "12", "13", "13", "16", "16", "17", '20', '21', '22', '22', '23'], 
                    "b": [1000, 2000, 1000, 3000, 1100, 1300, 1100, 1000, 1100, 2000, 2000, 2100],
                    "c": ['F', 'D', 'F', 'A', 'B','C','B', 'B', 'B', 'A', 'A', 'B']
                    })
df = spark.createDataFrame(pddf)
df.show()

+-----+---------+----+---+
|group|id_number|   b|  c|
+-----+---------+----+---+
|    a|       12|1000|  F|
|    a|       12|2000|  D|
|    a|       13|1000|  F|
|    a|       13|3000|  A|
|    b|       16|1100|  B|
|    b|       16|1300|  C|
|    b|       17|1100|  B|
|    c|       20|1000|  B|
|    c|       21|1100|  B|
|    d|       22|2000|  A|
|    d|       22|2000|  A|
|    d|       23|2100|  B|
+-----+---------+----+---+

Desire outcome:

+-----+---------+----+---+-----+
|group|id_number|   b|  c| dups|
+-----+---------+----+---+-----+
|    a|       12|1000|  F|    1|
|    a|       12|2000|  D|    0|
|    a|       13|1000|  F|    1|
|    a|       13|3000|  A|    0|
|    b|       16|1100|  B|    1|
|    b|       16|1300|  C|    0|
|    b|       17|1100|  B|    1|
|    c|       20|1000|  B|    0|
|    c|       21|1100|  B|    0|
|    d|       22|2000|  A|    0|
|    d|       22|2000|  A|    0|
|    d|       23|2100|  B|    0|
+-----+---------+----+---+-----+

Upvotes: 0

Views: 76

Answers (1)

ARCrow
ARCrow

Reputation: 1857

I think you're looking for something like this:

import pyspark.sql.functions as f
df = (
    df
    .withColumn('dups', f.count(f.col('id_number')).over(Window.partitionBy('group', 'b', 'c')) - 1)
)

Upvotes: 1

Related Questions