Reputation: 325
I am trying to remove duplicated based on multiple criteria:
Find duplicated in column df['A']
Check column df['status']
and prioritize OK vs Open and Open vs Close
if we have a duplicate with same status pick the lates one based on df['Col_1]
df = pd.DataFrame({'A' : ['11', '11', '12', np.nan, '13', '13', '14', '14', '15'], 'Status' : ['OK','Close','Close','OK','OK','Open','Open','Open',np.nan], 'Col_1' :[2000, 2001, 2000, 2000, 2000, 2002, 2000, 2004, 2000]}) df
Expected output:
I have tried differente solutions like the links below (map or loc) but I am unable to find the correct way:
Pandas : remove SOME duplicate values based on conditions
Upvotes: 1
Views: 964
Reputation: 862641
Create ordered categorical for prioritize Status
, then sorting per all columns, remove duplicates by first column A
and last sorting index:
c = ['OK','Open','Close']
df['Status'] = pd.Categorical(df['Status'], ordered=True, categories=c)
df = df.sort_values(['A','Status','Col_1']).drop_duplicates('A').sort_index()
print (df)
A Status Col_1
0 11 OK 2000
2 12 Close 2000
3 NaN OK 2000
4 13 OK 2000
6 14 Open 2000
8 15 NaN 2000
EDIT If need avoid NaN
s are removed add helper column:
df['test'] = df['A'].isna().cumsum()
c = ['OK','Open','Close']
df['Status'] = pd.Categorical(df['Status'], ordered=True, categories=c)
df = (df.sort_values(['A','Status','Col_1', 'test'])
.drop_duplicates(['A', 'test'])
.sort_index())
Upvotes: 1