Pineapple P
Pineapple P

Reputation: 83

Drop redundant rows for group in pandas

I have the following DataFrame:

import pandas as pd

data = {'id':  ['A','A','A','A','A','A',
                'A','A','A','A','A','A',
                'B','B','B','B','B','B',
               'C', 'C', 'C', 'C', 'C', 'C',
               'D', 'D', 'D', 'D', 'D', 'D'],
        'city':['London', 'London','London', 'London', 'London', 'London',
                'New York', 'New York', 'New York', 'New York', 'New York', 'New York',
                'Milan', 'Milan','Milan', 'Milan','Milan', 'Milan',
               'Paris', 'Paris', 'Paris', 'Paris', 'Paris', 'Paris',
               'Berlin', 'Berlin','Berlin', 'Berlin','Berlin', 'Berlin'],
        'year': [2000,2001, 2002, 2003, 2004, 2005,
                2000,2001, 2002, 2003, 2004, 2005,
                 2000,2001, 2002, 2003, 2004, 2005,
                2000,2001, 2002, 2003, 2004, 2005,
                2000,2001, 2002, 2003, 2004, 2005],
        't': [0,0,0,0,1,0,
              0,0,0,0,0,1,
              0,0,0,0,0,0,
              0,0,1,0,0,0,
             0,0,0,0,1,0]}

df = pd.DataFrame(data)

For each group id - city, I need to drop the rows for those higher years after t=1. For instance, id = A is in London in year=2004 (t=1). I want to drop the rows for the group A - London when year=2005. Please note that if an id is never in a city over 2000-2005, I want to keep all the rows (see, for instance, id = B in Milan).

The desired output:

import pandas as pd

data = {'id':  ['A','A','A','A','A',
                'A','A','A','A','A','A',
                'B','B','B','B','B','B',
               'C', 'C', 'C',
               'D', 'D', 'D', 'D', 'D'],
        'city':['London', 'London','London', 'London', 'London',
                'New York', 'New York', 'New York', 'New York', 'New York', 'New York',
                'Milan', 'Milan','Milan', 'Milan','Milan', 'Milan',
               'Paris', 'Paris', 'Paris',
               'Berlin', 'Berlin','Berlin', 'Berlin','Berlin'],
        'year': [2000,2001, 2002, 2003, 2004,
                2000,2001, 2002, 2003, 2004, 2005,
                 2000,2001, 2002, 2003, 2004, 2005,
                2000,2001, 2002,
                2000,2001, 2002, 2003, 2004],
        't': [0,0,0,0,1,
              0,0,0,0,0,1,
              0,0,0,0,0,0,
              0,0,1,
             0,0,0,0,1]}

df = pd.DataFrame(data)

Upvotes: 0

Views: 40

Answers (1)

jezrael
jezrael

Reputation: 862581

Idea is use cumualtive sum per groups, but need shift values and then remove all rows after first 1 in boolean indexing:

#if not sorted years per groups
#df = df.sort_values(['id','city','year'])

df = df[~df.groupby(['id', 'city'])['t'].transform(lambda x: x.shift().cumsum()).gt(0)]
             
print (df)
   id      city  year  t
0   A    London  2000  0
1   A    London  2001  0
2   A    London  2002  0
3   A    London  2003  0
4   A    London  2004  1
6   A  New York  2000  0
7   A  New York  2001  0
8   A  New York  2002  0
9   A  New York  2003  0
10  A  New York  2004  0
11  A  New York  2005  1
12  B     Milan  2000  0
13  B     Milan  2001  0
14  B     Milan  2002  0
15  B     Milan  2003  0
16  B     Milan  2004  0
17  B     Milan  2005  0
18  C     Paris  2000  0
19  C     Paris  2001  0
20  C     Paris  2002  1
24  D    Berlin  2000  0
25  D    Berlin  2001  0
26  D    Berlin  2002  0
27  D    Berlin  2003  0
28  D    Berlin  2004  1

Upvotes: 1

Related Questions