Reputation: 543
I'm a newbie to pandas and I'm having issues with writing a function to add a new column based on custom conditions: Following is my dataframe :
c1 c2 c3 c4 c5
0 1234 888 36.12733265 -115.1710473 7048929337
1 2341 70 33.62503113 -111.928576 7048929337
2 8910 419 40.734631 -73.8700321 9192939495
3 8910 910 40.734631 -73.8700321 9192939495
4 5678 1295 40.719729 -73.84412 5109400188
5 3345 4976 33.5350596 -112.2670918 9192939495
6 233345 2364 33.5350596 -112.2670918 4806391796
7 3010 1155 42.8254528 -71.5012724 2393900772
8 3010 6800 41.0488534 -75.313324 8434975913
9 4534 1791 42.955875 -76.92238325 9048190206
10 7658 4711 40.7635948 -73.3066489 6312542029
11 7658 9120 34.8465348 -117.0854289 6312542029
I want to add new column called dup, which contains flags to indicate whether that row is duplicate(duplicate in terms of specific columns) or not. Priority is in the following order :
1.) if row is duplicate in terms of both c3 and c4, flag should be dup_c3c4
2.)else if row is duplicate in terms of c5, flag should be dup_c5
3.) else if row is duplicate in terms of c1 , then flag should be dup_c1
4.)else flag should be NaD(Not a Duplicate).
Expected output :
c1 c2 c3 c4 c5 DUP
0 1234 888 36.12733265 -115.1710473 7048929337 dup_c5
1 2341 70 33.62503113 -111.928576 7048929337 dup_c5
2 8910 419 40.734631 -73.8700321 9192939495 dup_c4c5
3 8910 910 40.734631 -73.8700321 9192939495 dup_c4c5
4 5678 1295 40.719729 -73.84412 5109400188 NaD
5 3345 4976 33.5350596 -112.2670918 9192939495 dup_c4c5
6 233345 2364 33.5350596 -112.2670918 4806391796 dup_c4c5
7 3010 1155 42.8254528 -71.5012724 2393900772 dup_c1
8 3010 6800 41.0488534 -75.313324 8434975913 dup_c1
9 4534 1791 42.955875 -76.92238325 9048190206 NaD
10 7658 4711 40.7635948 -73.3066489 6312542029 dup_c5
11 7658 9120 34.8465348 -117.0854289 6312542029 dup_c5
Can any one suggest me how to write a custom function for this scenario using if else or in any other efficient manner.
Upvotes: 1
Views: 3131
Reputation: 863801
Use numpy.select
with duplicated
for 3 different conditions:
m1 = df.duplicated(['c3','c4'], keep=False)
m2 = df.duplicated(['c5'], keep=False)
m3 = df.duplicated(['c1'], keep=False)
df['DUP'] = np.select([m1,m2,m3],['dup_c3c4','dup_c5','dup_c1'], default='NaD')
print (df)
c1 c2 c3 c4 c5 DUP
0 1234 888 36.127333 -115.171047 7048929337 dup_c5
1 2341 70 33.625031 -111.928576 7048929337 dup_c5
2 8910 419 40.734631 -73.870032 9192939495 dup_c3c4
3 8910 910 40.734631 -73.870032 9192939495 dup_c3c4
4 5678 1295 40.719729 -73.844120 5109400188 NaD
5 3345 4976 33.535060 -112.267092 9192939495 dup_c3c4
6 233345 2364 33.535060 -112.267092 4806391796 dup_c3c4
7 3010 1155 42.825453 -71.501272 2393900772 dup_c1
8 3010 6800 41.048853 -75.313324 8434975913 dup_c1
9 4534 1791 42.955875 -76.922383 9048190206 NaD
10 7658 4711 40.763595 -73.306649 6312542029 dup_c5
11 7658 9120 34.846535 -117.085429 6312542029 dup_c5
If need function:
def f(df):
m1 = df.duplicated(['c3','c4'], keep=False)
m2 = df.duplicated(['c5'], keep=False)
m3 = df.duplicated(['c1'], keep=False)
df['DUP'] = np.select([m1,m2,m3],['dup_c3c4','dup_c5','dup_c1'], default='NaD')
return df
df1 = f(df)
Upvotes: 4