Reputation: 672

remove duplicate rows based on specific criteria with pandas

I have the dataframe below:

   Id   Name    Sales   Rent    Rate
40808   A2      0       43      340
17486   DV      491     0       346
17486   D       0       0       0
27977   AM      0       0       0
27977   A-M     0       0       94
80210   O-9     0       0       0
80210   M-1     0       0       -37
15545   M-2     0       0       -17
15545   O-8     0       0       0
53549   A-M7    0       0       0
53549   A-M8    0       0       50
40808   A       0       0       0
66666   MK      0       0       0

I want to remove duplicate rows based on Id values(exp 40808) and to keep only the row that don't have 0 value in all the fields. I used the suggestion from the answer:

df['zero']=df.select_dtypes(['int','float']).eq(0).sum(axis=1)
df=df.sort_values(['zero','Id']).drop_duplicates(subset=['Id']).drop(columns='zero')

The output i got

      Id  Name  Sales  Rent     Rate
  40808    A2      0     43      340
  53549  A-M7      0      0        0
  27977    AM      0      0        0
  17486     D      0      0        0
  80210   M-1      0      0       -37
  15545   M-2      0      0       -17
   66666   MK       0      0        0

Expected output:

Id      Name    Sales   Rent    Rate
40808   A2      0       43      340
17486   DV      491     0       346
27977   A-M     0       0       94
80210   M-1     0       0       -37
15545   M-2     0       0       -17
53549   A-M8    0       0       50
66666   MK      0       0        0

Upvotes: 3

Answers (5)

KingOtto

Reputation: 1483

This works, proceeding in 2 steps:

# Step 1 - collect all rows that are *not* duplicates (based on ID)
non_duplicates_to_keep = df.drop_duplicates(subset='Id', keep=False)

# Step 2a - identify *all* rows that have duplicates (based on ID, keep all)
sub_df = df[df.duplicated('Id', keep=False)]

# Step 2b - of those duplicates, discard all that have "0" in any of the numeric columns
duplicates_to_keep = sub_df[(sub_df[sub_df._get_numeric_data().columns[1:]] != 0).sum(axis=1) > 0]

# join the 2 sets
pd.concat([non_duplicates_to_keep, duplicates_to_keep])

Beware, though: What you are asking will again lead to duplicates (as per your question): If you have duplicates (imagine, 4 lines, of which 2 have non-zero values), you will end up with a duplicate Id again after cleaing: 2 rows get removed because only 0, 2 remaining as they are non-zero.

Not the case in your dummy data, but happesn in general... Is this really what you are after?

Upvotes: 0

mozway

Reputation: 260690

Here is a working solution. It first splits the data in two. The rows that we keep for sure and the whole dataframe in which the rows to "keep-for-sure" are labeled with a NaN. Then we drop duplicates in this latter subset to ensure that one 0 containing row is kept only when a "keep-for-sure" row is not present. Finally we merge both subsets after dropping the "keep-for-sure" rows from the second subset.

cond = df[['rent', 'sale', 'Rate']].ne(0).any(axis=1)   # rows to keep for sure

pd.concat([df[cond],
           (df.assign(Name=df['Name'].where(~cond, float('nan')))   # flag keep-for-sure
              .loc[cond.sort_values().index]   # sort so that keep-for-sure are last
              .drop_duplicates(subset='id', keep='last')   # keep 0s row only if no keep-for-sure in group 
              .dropna(subset=['Name'])
            )
          ])

output:

     id Name  rent  sale
0  2340    A   180   -10
4  4467    F   180     5
5  2467    C    20    45
7  4567    w    12    76
1  1002    B     0     0

Upvotes: 1

ThePyGuy

Reputation: 18416

First create a masking to separate duplicate and non-duplicate rows based on Id, then concatenate non-duplicate slice with duplicate slice without all row values equal to 0.

>>> duplicateMask = df.duplicated('Id', keep=False)
>>> pd.concat([df.loc[duplicateMask & df[['Sales', 'Rent', 'Rate']].ne(0).any(axis=1)],
               df[~duplicateMask]])
       Id  Name  Sales  Rent  Rate
0   40808    A2      0    43   340
1   17486    DV    491     0   346
4   27977   A-M      0     0    94
6   80210   M-1      0     0   -37
7   15545   M-2      0     0   -17
10  53549  A-M8      0     0    50
12  66666    MK      0     0     0

Upvotes: 4

Corralien

Reputation: 120409

The problem on your sample is once you have remove rows with all zeros on columns [Sales, Rent, Rate], there are no more duplicate values.

I want to remove duplicate rows based on Id values(exp 40808) and to keep only the row that don't have 0 value in all the fields.

You should reverse the logic:

I want to keep only the row that don't have 0 value in all the fields and (then) remove duplicate rows based on Id values (exp 40808).

>>> df[df[['Sales', 'Rent', 'Rate']].eq(0).all(axis=1)].drop_duplicates('Id')

       Id  Name  Sales  Rent  Rate
2   17486     D      0     0     0
3   27977    AM      0     0     0
5   80210   O-9      0     0     0
8   15545   O-8      0     0     0
9   53549  A-M7      0     0     0
11  40808     A      0     0     0

Upvotes: 1

Anurag Dabas

Reputation: 24314

another way is to count the number of zero and negative numbers and then sort according to it and then drop duplicate values and finally remove the 'zero' column:

df['zero']=df.select_dtypes(['int','float']).eq(0).sum(axis=1)
df=df.sort_values(['zero','id']).drop_duplicates(subset=['id']).drop(columns='zero')

Upvotes: 2

remove duplicate rows based on specific criteria with pandas

Answers (5)

Related Questions