Efficiently Drop Rows in a Pandas Dataframe

Question

I have a dataset like:

    Id   Status

    1     0
    1     0
    1     0
    1     0
    1     1
    2     0
    1     0 # --> gets removed since this row appears after id 1 already had a status of 1
    2     0
    3     0
    3     0

I want to drop all rows of an id after its status became 1, i.e. my new dataset will be:

    Id   Status

    1     0
    1     0
    1     0
    1     0
    1     1
    2     0
    2     0
    3     0
    3     0

I want to learn how to implement this computation efficiently since I have a very large (200 GB+) dataset.

The solution I currently have is to find the index of the first 1 and slice each group that way. In cases where no 1 exists, return the group unchanged:

def remove(series):
    indexless = series.reset_index(drop=True)
    ones = indexless[indexless['Status'] == 1]
    if len(ones) > 0:
        return indexless.iloc[:ones.index[0] + 1]

    else:
        return indexless

df.groupby('Id').apply(remove).reset_index(drop=True)

However, this runs very slowly, any way to fix this or to alternatively speed up the computation?

jezrael · Accepted Answer

First idea is create cumulative sum per groups with boolean mask, but also necessary shift for avoid lost first 1:

#pandas 0.24+
s = (df['Status'] == 1).groupby(df['Id']).apply(lambda x: x.shift(fill_value=0).cumsum())
#pandas below
#s = (df['Status'] == 1).groupby(df['Id']).apply(lambda x: x.shift().fillna(0).cumsum())
df = df[s == 0]
print (df)
   Id  Status
0   1       0
1   1       0
2   1       0
3   1       0
4   1       1
5   2       0
7   2       0
8   3       0
9   3       0

Another solution is use custom lambda function with Series.idxmax:

def f(x):
    if x['new'].any():
        return x.iloc[:x['new'].idxmax()+1, :]
    else:
        return x

df1 = (df.assign(new=(df['Status'] == 1))
        .groupby(df['Id'], group_keys=False)
        .apply(f).drop('new', axis=1))
print (df1)
    Id  Status
0    1       0
1    1       0
2    1       0
3    1       0
4    1       1
5    2       0
8    2       0
9    3       0
10   3       0

Or a bit modified first solution - filter only groups with 1 and apply solutyion only there:

m = df['Status'].eq(1)
ids = df.loc[m, 'Id'].unique()
print (ids)
[1]

m1 = df['Id'].isin(m)
m2 = (m[m1].groupby(df['Id'])
            .apply(lambda x: x.shift(fill_value=0).cumsum())
            .eq(0))

df = df[m2.reindex(df.index, fill_value=True)]
print (df)
    Id  Status
0    1       0
1    1       0
2    1       0
3    1       0
4    1       1
5    2       0
8    2       0
9    3       0
10   3       0

Efficiently Drop Rows in a Pandas Dataframe

Answers (2)

Related Questions