Reputation: 83
I have a pandas dataframe of the following structure:
df = pd.DataFrame({'ID':['A001', 'A001', 'A001', 'A002', 'A002', 'A003', 'A003', 'A004', 'A004', 'A004', 'A005', 'A005'],
'Val1':[2, 2, 2, 5, 6, 8, 8, 3, 3, 3, 7, 7],
'Val2':[100, -100, 50, -40, 40, 60, -50, 10, -10, 10, 15, 15]})
ID Val1 Val2
0 A001 2 100
1 A001 2 -100
2 A001 2 50
3 A002 5 -40
4 A002 6 40
5 A003 8 60
6 A003 8 -50
7 A004 3 10
8 A004 3 -10
9 A004 3 10
10 A005 7 15
11 A005 7 15
I want to remove duplicate rows where ID and Val1 are duplicates, and where Val2 sums to zero across two rows. The positive/negative Val2 rows may not be consecutive either, even under a groupby
In the above sample data, rows 0 and 1, as well as 7, 8, 9 fulfill these criteria. I'd want to remove [0, 1], and either [7, 8] or [8, 9].
Another constraint here is that there could be entirely duplicate rows ([10, 11]). In this case, I want to keep both rows.
The desired output is thus:
ID Val1 Val2
2 A001 2 50
3 A002 5 -40
4 A002 6 40
5 A003 8 60
6 A003 8 -50
9 A004 3 10
10 A005 7 15
11 A005 7 15
Short of iterating over each row and looking for other rows which fit the criteria, I'm out of ideas for a more "pythonic" way to do this. Any help is much appreciated.
Upvotes: 8
Views: 1022
Reputation: 2495
Use groupby
and cumsum
to find which index of Val2
sums to zero
s = df.groupby(['ID', 'Val1']).Val2.cumsum() == 0
n = np.where(s==1)[0]
to_remove = np.concatenate((n, (n-1)))
new_df = df[~df.index.isin(to_remove)]
new_df
ID Val1 Val2
2 A001 2 50
3 A002 5 -40
4 A002 6 40
5 A003 8 60
6 A003 8 -50
9 A004 3 10
10 A005 7 15
11 A005 7 15
Upvotes: 1
Reputation: 28709
I put some comments in the code, so hopefully, my line of thought should be clear :
cond = df.assign(temp=df.Val2.abs())
# a way to get the same values (differentiated by their sign)
# to follow each other
cond = cond.sort_values(["ID", "Val1", "temp"])
# cumsum should yield a zero for numbers that are different
# only by their sign
cond["check"] = cond.groupby(["ID", "temp"]).Val2.cumsum()
cond["check"] = np.where(cond.check != 0, np.nan, cond.check)
# the backward fill here allows us to assign an identifier
# to the two values that summed to zero
cond["check"] = cond["check"].bfill(limit=1)
# this is where we implement your other condition
# essentially, it looks for rows that are duplicates
# and rows that any two rows sum to zero
cond.loc[
~(cond.duplicated(["ID", "Val1"], keep=False) & (cond.check == 0)),
["ID", "Val1", "Val2"],
]
ID Val1 Val2
2 A001 2 50
3 A002 5 -40
4 A002 6 40
6 A003 8 -50
5 A003 8 60
9 A004 3 10
Upvotes: 2
Reputation: 11512
I believe there might be a less "brute force" method than this one, but it has the merit of being transparent.
import pandas as pd
df = pd.DataFrame({'ID':['A001', 'A001', 'A001', 'A002', 'A002', 'A003', 'A003', 'A004', 'A004', 'A004'],
'Val1':[2, 2, 2, 5, 6, 8, 8, 3, 3, 3],
'Val2':[100, -100, 50, 40, 45, 60, -50, 10, -10, 10]})
df['Val3'] = df['Val2'].abs()
df2 = df.drop_duplicates()
df2 = df.drop(['Val2'], axis = 1)
df3 = df2.drop_duplicates()
result = pd.merge(df3, df, left_index=True, right_index=True, how='inner')
results = result.drop(['ID_x', 'Val1_x', 'Val3_x', 'Val3_y'],axis = 1)
Upvotes: 0
Reputation: 320
Use drop_duplicates
method to remove duplicates.
Here is the sample of code:
>>> df = pd.DataFrame({'ID':['A001', 'A001', 'A001', 'A002', 'A002', 'A003', 'A003', 'A004', 'A004', 'A004'],
... 'Val1':[2, 2, 2, 5, 6, 8, 8, 3, 3, 3],
... 'Val2':[100, -100, 50, 40, 45, 60, -50, 10, -10, 10]})
>>>
>>> df.drop_duplicates(subset="Val2", keep= "last", inplace = True)
>>> df
ID Val1 Val2
0 A001 2 100
1 A001 2 -100
2 A001 2 50
3 A002 5 40
4 A002 6 45
5 A003 8 60
6 A003 8 -50
8 A004 3 -10
9 A004 3 10
>>> df.drop_duplicates(subset="Val1", keep= "last", inplace = True)
>>> df
ID Val1 Val2
2 A001 2 50
3 A002 5 40
4 A002 6 45
6 A003 8 -50
9 A004 3 10
>>> ~
If you can explain more clearly what does it mean:
and where Val2 sums to zero across two rows.
?
This can help you further to get to the full solution.
Upvotes: 0
Reputation: 2534
What about :
temp = df.groupby('ID')[['Val2']].rolling(2).sum()
ix = temp[temp.Val2==0].index
ar = np.array([x[1] for x in ix.values])
ix2 = ar.tolist() + (ar-1).tolist()
df.drop(ix2, inplace=True)
df.drop_duplicates(['ID', 'Val1'], keep='first', inplace=True)
But this answers refers to your "textual" answer : rows 8 & 9 'Val2' are actually summing to zero (which is not what you the "desired output" you posted)...
Upvotes: 0