Reputation: 1319

Drop Rows of an id after a particular column value in Pandas

I have a dataset like:

Id   Status

1     0
1     0
1     0
1     0
1     1
2     0
1     0
2     0
3     0
3     0

I want to drop all rows of an id after its status became 1, i.e. my new dataset will be:

Id   Status

1     0
1     0
1     0
1     0
1     1
2     0
2     0
3     0
3     0

i.e.

1     0   --> gets removed since this row appears after id 1 already had a status of 1

How to implement it efficiently since I have a very large (200 GB+) dataset.

Thanks for your help.

Upvotes: 7

Answers (3)

gmds

Reputation: 19885

EDIT: Revisiting this question a month later, there is actually a much simpler way with groupby and cumsum: Just group by Id and take the cumsum of Status, then drop the values where the cumsum is more than 0:

df[df.groupby('Id')['Status'].cumsum() < 1]

The best way I have found is to find the index of the first 1 and slice each group that way. In cases where no 1 exists, return the group unchanged:

def remove(series):
    indexless = series.reset_index(drop=True)
    ones = indexless[indexless['Status'] == 1]
    if len(ones) > 0:
        return indexless.iloc[:ones.index[0] + 1]

    else:
        return indexless

df.groupby('Id').apply(remove).reset_index(drop=True)

Output:

   Id  Status
0   1       0
1   1       0
2   1       0
3   1       0
4   1       1
5   2       0
6   2       0
7   3       0
8   3       0

Upvotes: 2

ResidentSleeper

Reputation: 2495

Use groupby with cumsum to find where status is 1.

res = df.groupby('Id', group_keys=False).apply(lambda x: x[x.Status.cumsum() > 0])
res

    Id  Status
4   1   1
6   1   0

Exclude index that Status==0.

not_select_id = res[res.Status==0].index

df[~df.index.isin(not_select_id)]

Id  Status
0   1   0
1   1   0
2   1   0
3   1   0
4   1   1
5   2   0
7   2   0
8   3   0
9   3   0

Upvotes: 1

Toby Petty

Reputation: 4670

Here's an idea;

You can create a dict with the first index where the status is 1 for each ID (assuming the DataFrame is sorted by ID):

d = df.loc[df["Status"]==1].drop_duplicates()
d = dict(zip(d["Id"], d.index))

Then you create a column with the first status=1 for each Id:

df["first"] = df["Id"].map(d)

Finally you drop every row where the index is less than than the first column:

df = df.loc[df.index<df["first"]]

Upvotes: 2

Drop Rows of an id after a particular column value in Pandas

Answers (3)

Related Questions