Caiotru
Caiotru

Reputation: 325

Pandas / Python remove duplicates based on specific row values

I am trying to remove duplicated based on multiple criteria:

  1. Find duplicated in column df['A']

  2. Check column df['status'] and prioritize OK vs Open and Open vs Close

  3. if we have a duplicate with same status pick the lates one based on df['Col_1]

df = pd.DataFrame({'A' : ['11', '11', '12', np.nan, '13', '13', '14', '14', '15'], 'Status' : ['OK','Close','Close','OK','OK','Open','Open','Open',np.nan], 'Col_1' :[2000, 2001, 2000, 2000, 2000, 2002, 2000, 2004, 2000]}) df

Expected output:

enter image description here

I have tried differente solutions like the links below (map or loc) but I am unable to find the correct way:

Pandas : remove SOME duplicate values based on conditions

Upvotes: 1

Views: 964

Answers (1)

jezrael
jezrael

Reputation: 862641

Create ordered categorical for prioritize Status, then sorting per all columns, remove duplicates by first column A and last sorting index:

c = ['OK','Open','Close']
df['Status'] = pd.Categorical(df['Status'], ordered=True, categories=c)

df = df.sort_values(['A','Status','Col_1']).drop_duplicates('A').sort_index()
print (df)
     A Status  Col_1
0   11     OK   2000
2   12  Close   2000
3  NaN     OK   2000
4   13     OK   2000
6   14   Open   2000
8   15    NaN   2000

EDIT If need avoid NaNs are removed add helper column:

df['test'] = df['A'].isna().cumsum()

c = ['OK','Open','Close']
df['Status'] = pd.Categorical(df['Status'], ordered=True, categories=c)

df = (df.sort_values(['A','Status','Col_1', 'test'])
        .drop_duplicates(['A', 'test'])
        .sort_index())

Upvotes: 1

Related Questions