Reputation: 83
I have the following DataFrame:
import pandas as pd
data = {'id': ['A','A','A','A','A','A',
'A','A','A','A','A','A',
'B','B','B','B','B','B',
'C', 'C', 'C', 'C', 'C', 'C',
'D', 'D', 'D', 'D', 'D', 'D'],
'city':['London', 'London','London', 'London', 'London', 'London',
'New York', 'New York', 'New York', 'New York', 'New York', 'New York',
'Milan', 'Milan','Milan', 'Milan','Milan', 'Milan',
'Paris', 'Paris', 'Paris', 'Paris', 'Paris', 'Paris',
'Berlin', 'Berlin','Berlin', 'Berlin','Berlin', 'Berlin'],
'year': [2000,2001, 2002, 2003, 2004, 2005,
2000,2001, 2002, 2003, 2004, 2005,
2000,2001, 2002, 2003, 2004, 2005,
2000,2001, 2002, 2003, 2004, 2005,
2000,2001, 2002, 2003, 2004, 2005],
't': [0,0,0,0,1,0,
0,0,0,0,0,1,
0,0,0,0,0,0,
0,0,1,0,0,0,
0,0,0,0,1,0]}
df = pd.DataFrame(data)
For each group id
- city
, I need to drop the rows for those higher years after t=1
. For instance, id = A
is in London in year=2004
(t=1
). I want to drop the rows for the group A
- London
when year=2005
. Please note that if an id
is never in a city over 2000-2005, I want to keep all the rows (see, for instance, id = B
in Milan).
The desired output:
import pandas as pd
data = {'id': ['A','A','A','A','A',
'A','A','A','A','A','A',
'B','B','B','B','B','B',
'C', 'C', 'C',
'D', 'D', 'D', 'D', 'D'],
'city':['London', 'London','London', 'London', 'London',
'New York', 'New York', 'New York', 'New York', 'New York', 'New York',
'Milan', 'Milan','Milan', 'Milan','Milan', 'Milan',
'Paris', 'Paris', 'Paris',
'Berlin', 'Berlin','Berlin', 'Berlin','Berlin'],
'year': [2000,2001, 2002, 2003, 2004,
2000,2001, 2002, 2003, 2004, 2005,
2000,2001, 2002, 2003, 2004, 2005,
2000,2001, 2002,
2000,2001, 2002, 2003, 2004],
't': [0,0,0,0,1,
0,0,0,0,0,1,
0,0,0,0,0,0,
0,0,1,
0,0,0,0,1]}
df = pd.DataFrame(data)
Upvotes: 0
Views: 40
Reputation: 862581
Idea is use cumualtive sum per groups, but need shift values and then remove all rows after first 1
in boolean indexing
:
#if not sorted years per groups
#df = df.sort_values(['id','city','year'])
df = df[~df.groupby(['id', 'city'])['t'].transform(lambda x: x.shift().cumsum()).gt(0)]
print (df)
id city year t
0 A London 2000 0
1 A London 2001 0
2 A London 2002 0
3 A London 2003 0
4 A London 2004 1
6 A New York 2000 0
7 A New York 2001 0
8 A New York 2002 0
9 A New York 2003 0
10 A New York 2004 0
11 A New York 2005 1
12 B Milan 2000 0
13 B Milan 2001 0
14 B Milan 2002 0
15 B Milan 2003 0
16 B Milan 2004 0
17 B Milan 2005 0
18 C Paris 2000 0
19 C Paris 2001 0
20 C Paris 2002 1
24 D Berlin 2000 0
25 D Berlin 2001 0
26 D Berlin 2002 0
27 D Berlin 2003 0
28 D Berlin 2004 1
Upvotes: 1