Reputation: 859
I would like to remove records from a database by ID if they meet the following condition of if there a duplicate of a record and the "second" record shows the negative inverse of the first record's transaction amount. In the example below, I would like to remove IDs 2 and 4.
Here is my sample database
ID transaction
1 203
2 506
2 -506
3 700
4 234
4 -234
5 120
5 230
Desired output
ID transaction
1 203
3 700
5 120
5 230
Upvotes: 0
Views: 167
Reputation: 3770
you can use drop_duplicated
to drop the rows having same ID
.
In [58]: df = pd.DataFrame({'ID':[1,2,2,3,4,4],
...: 'transaction':['$203','$506','$(506)','$700','$234','$(234)']})
In [59]: df
Out[59]:
ID transaction
0 1 $203
1 2 $506
2 2 $(506)
3 3 $700
4 4 $234
5 4 $(234)
In [60]: df.drop_duplicates(subset=['ID'],keep=False,inplace=True)
In [61]: df
Out[61]:
ID transaction
0 1 $203
3 3 $700
you can read more about the drop_duplicate here.
Update Based on the data update by OP, a column can be created having the absolute value of the transaction and use drop_dduplicate to drop rows having same values in ID and new column.
In [68]: df['abs_transactions'] = df.transaction.abs()
In [69]: df
Out[69]:
ID transaction abs_transactions
0 1 203 203
1 2 506 506
2 2 -506 506
3 3 700 700
4 4 234 234
5 4 -234 234
6 5 120 120
7 5 230 230
In [71]: df.drop_duplicates(subset=['ID','abs_transactions'],keep=False,inplace=
...: True)
In [72]: df
Out[72]:
ID transaction abs_transactions
0 1 203 203
3 3 700 700
6 5 120 120
7 5 230 230
In [74]: df.drop('abs_transactions',inplace=True,axis=1)
In [75]: df
Out[75]:
ID transaction
0 1 203
3 3 700
6 5 120
7 5 230
Upvotes: 1
Reputation: 13821
If you dataframe includes only those columns, you can simply convert the dataframe to abs
and then drop_duplicates()
in a single line:
df_new = abs(df).drop_duplicates(keep=False)
which prints:
print(df_new)
Out[23]:
ID transaction
0 1 203
3 3 700
6 5 120
7 5 230
Upvotes: 1
Reputation: 1194
I believe that you can accomplish this by summing by the ID and selecting all transactions that sum to > 0.
>>> df = pd.DataFrame.from_records([(1,203),(2,506),(2,-506),(3,700),(4,234),(4,-234)], columns=['ID', 'transaction'])
>>> transaction_totals = df.groupby('ID').transaction.sum()
>>> transaction_totals.loc[transaction_totals > 0].reset_index()
ID transaction
0 1 203
1 3 700
Upvotes: 2