Reputation: 672
I have the dataframe below:
Id Name Sales Rent Rate
40808 A2 0 43 340
17486 DV 491 0 346
17486 D 0 0 0
27977 AM 0 0 0
27977 A-M 0 0 94
80210 O-9 0 0 0
80210 M-1 0 0 -37
15545 M-2 0 0 -17
15545 O-8 0 0 0
53549 A-M7 0 0 0
53549 A-M8 0 0 50
40808 A 0 0 0
66666 MK 0 0 0
I want to remove duplicate rows based on Id values(exp 40808) and to keep only the row that don't have 0 value in all the fields. I used the suggestion from the answer:
df['zero']=df.select_dtypes(['int','float']).eq(0).sum(axis=1)
df=df.sort_values(['zero','Id']).drop_duplicates(subset=['Id']).drop(columns='zero')
The output i got
Id Name Sales Rent Rate
40808 A2 0 43 340
53549 A-M7 0 0 0
27977 AM 0 0 0
17486 D 0 0 0
80210 M-1 0 0 -37
15545 M-2 0 0 -17
66666 MK 0 0 0
Expected output:
Id Name Sales Rent Rate
40808 A2 0 43 340
17486 DV 491 0 346
27977 A-M 0 0 94
80210 M-1 0 0 -37
15545 M-2 0 0 -17
53549 A-M8 0 0 50
66666 MK 0 0 0
Upvotes: 3
Views: 3873
Reputation: 1483
This works, proceeding in 2 steps:
# Step 1 - collect all rows that are *not* duplicates (based on ID)
non_duplicates_to_keep = df.drop_duplicates(subset='Id', keep=False)
# Step 2a - identify *all* rows that have duplicates (based on ID, keep all)
sub_df = df[df.duplicated('Id', keep=False)]
# Step 2b - of those duplicates, discard all that have "0" in any of the numeric columns
duplicates_to_keep = sub_df[(sub_df[sub_df._get_numeric_data().columns[1:]] != 0).sum(axis=1) > 0]
# join the 2 sets
pd.concat([non_duplicates_to_keep, duplicates_to_keep])
Beware, though: What you are asking will again lead to duplicates (as per your question): If you have duplicates (imagine, 4 lines, of which 2 have non-zero values), you will end up with a duplicate Id
again after cleaing: 2 rows get removed because only 0, 2 remaining as they are non-zero.
Not the case in your dummy data, but happesn in general... Is this really what you are after?
Upvotes: 0
Reputation: 260690
Here is a working solution. It first splits the data in two. The rows that we keep for sure and the whole dataframe in which the rows to "keep-for-sure" are labeled with a NaN. Then we drop duplicates in this latter subset to ensure that one 0 containing row is kept only when a "keep-for-sure" row is not present. Finally we merge both subsets after dropping the "keep-for-sure" rows from the second subset.
cond = df[['rent', 'sale', 'Rate']].ne(0).any(axis=1) # rows to keep for sure
pd.concat([df[cond],
(df.assign(Name=df['Name'].where(~cond, float('nan'))) # flag keep-for-sure
.loc[cond.sort_values().index] # sort so that keep-for-sure are last
.drop_duplicates(subset='id', keep='last') # keep 0s row only if no keep-for-sure in group
.dropna(subset=['Name'])
)
])
output:
id Name rent sale
0 2340 A 180 -10
4 4467 F 180 5
5 2467 C 20 45
7 4567 w 12 76
1 1002 B 0 0
Upvotes: 1
Reputation: 18416
First create a masking to separate duplicate and non-duplicate rows based on Id
, then concatenate non-duplicate slice with duplicate slice without all row values equal to 0.
>>> duplicateMask = df.duplicated('Id', keep=False)
>>> pd.concat([df.loc[duplicateMask & df[['Sales', 'Rent', 'Rate']].ne(0).any(axis=1)],
df[~duplicateMask]])
Id Name Sales Rent Rate
0 40808 A2 0 43 340
1 17486 DV 491 0 346
4 27977 A-M 0 0 94
6 80210 M-1 0 0 -37
7 15545 M-2 0 0 -17
10 53549 A-M8 0 0 50
12 66666 MK 0 0 0
Upvotes: 4
Reputation: 120409
The problem on your sample is once you have remove rows with all zeros on columns [Sales, Rent, Rate]
, there are no more duplicate values.
I want to remove duplicate rows based on Id values(exp 40808) and to keep only the row that don't have 0 value in all the fields.
You should reverse the logic:
I want to keep only the row that don't have 0 value in all the fields and (then) remove duplicate rows based on Id values (exp 40808).
>>> df[df[['Sales', 'Rent', 'Rate']].eq(0).all(axis=1)].drop_duplicates('Id')
Id Name Sales Rent Rate
2 17486 D 0 0 0
3 27977 AM 0 0 0
5 80210 O-9 0 0 0
8 15545 O-8 0 0 0
9 53549 A-M7 0 0 0
11 40808 A 0 0 0
Upvotes: 1
Reputation: 24314
another way is to count the number of zero and negative numbers and then sort according to it and then drop duplicate values and finally remove the 'zero' column:
df['zero']=df.select_dtypes(['int','float']).eq(0).sum(axis=1)
df=df.sort_values(['zero','id']).drop_duplicates(subset=['id']).drop(columns='zero')
Upvotes: 2