Raven
Raven

Reputation: 859

Remove Duplicate Rows by ID if Both Meet a Condition

I would like to remove records from a database by ID if they meet the following condition of if there a duplicate of a record and the "second" record shows the negative inverse of the first record's transaction amount. In the example below, I would like to remove IDs 2 and 4.

Here is my sample database

ID  transaction 
1        203
2        506
2       -506
3        700
4        234
4       -234
5        120
5        230

Desired output

ID  transaction 
1       203
3       700
5       120
5       230

Upvotes: 0

Views: 167

Answers (3)

Rajat Mishra
Rajat Mishra

Reputation: 3770

you can use drop_duplicated to drop the rows having same ID.

In [58]: df = pd.DataFrame({'ID':[1,2,2,3,4,4], 
    ...: 'transaction':['$203','$506','$(506)','$700','$234','$(234)']})        

In [59]: df                                                                     
Out[59]: 
   ID transaction
0   1        $203
1   2        $506
2   2      $(506)
3   3        $700
4   4        $234
5   4      $(234)

In [60]: df.drop_duplicates(subset=['ID'],keep=False,inplace=True)              

In [61]: df                                                                     
Out[61]: 
   ID transaction
0   1        $203
3   3        $700

you can read more about the drop_duplicate here.

Update Based on the data update by OP, a column can be created having the absolute value of the transaction and use drop_dduplicate to drop rows having same values in ID and new column.

In [68]: df['abs_transactions'] = df.transaction.abs()                          

In [69]: df                                                                     
Out[69]: 
   ID  transaction  abs_transactions
0   1          203               203
1   2          506               506
2   2         -506               506
3   3          700               700
4   4          234               234
5   4         -234               234
6   5          120               120
7   5          230               230

In [71]: df.drop_duplicates(subset=['ID','abs_transactions'],keep=False,inplace=
    ...: True)                                                                  

In [72]: df                                                                     
Out[72]: 
   ID  transaction  abs_transactions
0   1          203               203
3   3          700               700
6   5          120               120
7   5          230               230

In [74]: df.drop('abs_transactions',inplace=True,axis=1)                        

In [75]: df                                                                     
Out[75]: 
   ID  transaction
0   1          203
3   3          700
6   5          120
7   5          230

Upvotes: 1

sophocles
sophocles

Reputation: 13821

If you dataframe includes only those columns, you can simply convert the dataframe to abs and then drop_duplicates() in a single line:

df_new = abs(df).drop_duplicates(keep=False)

which prints:

print(df_new)

Out[23]: 
   ID  transaction
0   1          203
3   3          700
6   5          120
7   5          230

Upvotes: 1

Matthew Cox
Matthew Cox

Reputation: 1194

I believe that you can accomplish this by summing by the ID and selecting all transactions that sum to > 0.

>>> df = pd.DataFrame.from_records([(1,203),(2,506),(2,-506),(3,700),(4,234),(4,-234)], columns=['ID', 'transaction'])
>>> transaction_totals = df.groupby('ID').transaction.sum()
>>> transaction_totals.loc[transaction_totals > 0].reset_index()
   ID  transaction
0   1          203
1   3          700

Upvotes: 2

Related Questions