Removing duplicate rows in a dataframe with some conditions on data in a particular column

Question

I have the following dataframe, df

Index   time   block   cell
 0       9      25      c1
 1       9      25      c1
 2       33     35      c2
 3       47     4       c1
 4       47     17      c2
 5       100    21      c1
 6       120    21      c1
 7       120    36      c2

The duplicates are to be dropped based on time column. However, there is a condition: - if two or more similar times have the same cells, for example, index 0 and index 1 have c1 then keep any of the columns. - if two or more similar times have different cells eg index 3 and 4 and index 6 and 7 then keep all the rows corresponding to duplicate times

The resulting data frame will be as follows: df_result =

Index   time   block   cell
 0       9      25      c1
 2       33     35      c2
 3       47     4       c1
 4       47     17      c2
 5       100    21      c1
 6       120    21      c1
 7       120    36      c2

Tried df.drop_duplicates('time')

Aryan Jain · Accepted Answer

You can achieve this by binning the original DataFrame into categories and then running drop_duplicates() within each category.

import pandas as pd

df = pd.DataFrame({'time':[9,9,33,47,47,100,120,120],'block':[25,25,35,4,17,21,21,36],'cell':'c1;c1;c2;c1;c2;c1;c1;c2'.split(';')})

categories = df['cell'].astype('category').unique()
df2 = pd.DataFrame()
for category in categories:
    df2 = pd.concat([df2, df[df['cell'] == category].drop_duplicates(keep='first')])

df2 = df2.sort_index()

This will result in df2 being

    time  block cell
0     9     25   c1
2    33     35   c2
3    47      4   c1
4    47     17   c2
5   100     21   c1
6   120     21   c1
7   120     36   c2

Removing duplicate rows in a dataframe with some conditions on data in a particular column

Answers (2)

Related Questions