Avinash Clinton
Avinash Clinton

Reputation: 543

adding new column to pandas dataframe to indicate duplicates in specific columns

I'm a newbie to pandas and I'm having issues with writing a function to add a new column based on custom conditions: Following is my dataframe :

    c1      c2     c3           c4              c5
0   1234    888    36.12733265  -115.1710473    7048929337
1   2341    70     33.62503113  -111.928576     7048929337
2   8910    419    40.734631    -73.8700321     9192939495
3   8910    910    40.734631    -73.8700321     9192939495
4   5678    1295   40.719729    -73.84412       5109400188
5   3345    4976   33.5350596   -112.2670918    9192939495
6   233345  2364   33.5350596   -112.2670918    4806391796
7   3010    1155   42.8254528   -71.5012724     2393900772
8   3010    6800   41.0488534   -75.313324      8434975913
9   4534    1791   42.955875    -76.92238325    9048190206
10  7658    4711   40.7635948   -73.3066489     6312542029
11  7658    9120   34.8465348   -117.0854289    6312542029

I want to add new column called dup, which contains flags to indicate whether that row is duplicate(duplicate in terms of specific columns) or not. Priority is in the following order :

1.) if row is duplicate in terms of both c3 and c4, flag should be dup_c3c4

2.)else if row is duplicate in terms of c5, flag should be dup_c5

3.) else if row is duplicate in terms of c1 , then flag should be dup_c1

4.)else flag should be NaD(Not a Duplicate).

Expected output :

    c1      c2      c3            c4            c5          DUP
0   1234    888     36.12733265  -115.1710473   7048929337  dup_c5
1   2341    70      33.62503113  -111.928576    7048929337  dup_c5
2   8910    419     40.734631    -73.8700321    9192939495  dup_c4c5
3   8910    910     40.734631    -73.8700321    9192939495  dup_c4c5
4   5678    1295    40.719729    -73.84412      5109400188  NaD
5   3345    4976    33.5350596   -112.2670918   9192939495  dup_c4c5
6   233345  2364    33.5350596   -112.2670918   4806391796  dup_c4c5
7   3010    1155    42.8254528   -71.5012724    2393900772  dup_c1
8   3010    6800    41.0488534   -75.313324     8434975913  dup_c1
9   4534    1791    42.955875    -76.92238325   9048190206  NaD
10  7658    4711    40.7635948   -73.3066489    6312542029  dup_c5
11  7658    9120    34.8465348   -117.0854289   6312542029  dup_c5

Can any one suggest me how to write a custom function for this scenario using if else or in any other efficient manner.

Upvotes: 1

Views: 3131

Answers (1)

jezrael
jezrael

Reputation: 863801

Use numpy.select with duplicated for 3 different conditions:

m1 = df.duplicated(['c3','c4'], keep=False)
m2 = df.duplicated(['c5'], keep=False)
m3 = df.duplicated(['c1'], keep=False)

df['DUP'] = np.select([m1,m2,m3],['dup_c3c4','dup_c5','dup_c1'], default='NaD')
print (df)
        c1    c2         c3          c4          c5       DUP
0     1234   888  36.127333 -115.171047  7048929337    dup_c5
1     2341    70  33.625031 -111.928576  7048929337    dup_c5
2     8910   419  40.734631  -73.870032  9192939495  dup_c3c4
3     8910   910  40.734631  -73.870032  9192939495  dup_c3c4
4     5678  1295  40.719729  -73.844120  5109400188       NaD
5     3345  4976  33.535060 -112.267092  9192939495  dup_c3c4
6   233345  2364  33.535060 -112.267092  4806391796  dup_c3c4
7     3010  1155  42.825453  -71.501272  2393900772    dup_c1
8     3010  6800  41.048853  -75.313324  8434975913    dup_c1
9     4534  1791  42.955875  -76.922383  9048190206       NaD
10    7658  4711  40.763595  -73.306649  6312542029    dup_c5
11    7658  9120  34.846535 -117.085429  6312542029    dup_c5

If need function:

def f(df):
    m1 = df.duplicated(['c3','c4'], keep=False)
    m2 = df.duplicated(['c5'], keep=False)
    m3 = df.duplicated(['c1'], keep=False)

    df['DUP'] = np.select([m1,m2,m3],['dup_c3c4','dup_c5','dup_c1'], default='NaD')
    return df

df1 = f(df)

Upvotes: 4

Related Questions