Pandas drop ending tail zeros in column, grouping by another column

Question

I've created following function do drop ending zeros in column "value" for each "proj". But seems it's, so to speak, "not optimal") Hope someone coould give advise how to make it right.

Input tabble:

+---+------+---------+--------------+-------+
|   | proj | weeknum | weekduration | value |
+---+------+---------+--------------+-------+
| 0 |    1 |       1 |            4 |     0 |
| 1 |    1 |       2 |            4 |    11 |
| 2 |    1 |       3 |            4 |     0 |
| 3 |    1 |       4 |            4 |     0 |
| 4 |    2 |       1 |            4 |    10 |
| 5 |    2 |       2 |            4 |    11 |
| 6 |    2 |       3 |            4 |    12 |
| 7 |    2 |       4 |            4 |     0 |
+---+------+---------+--------------+-------+

For each "proj" we have to select and drop weeknum with "value" == zero, but only at the end (sort by "weeknum").

For this example I should get Indexes: 2, 3, 7

+---+------+---------+--------------+-------+
|   | proj | weeknum | weekduration | value |
+---+------+---------+--------------+-------+
| 0 |    1 |       1 |            4 |     0 |
| 1 |    1 |       2 |            4 |    11 |
| 4 |    2 |       1 |            4 |    10 |
| 5 |    2 |       2 |            4 |    11 |
| 6 |    2 |       3 |            4 |    12 |
+---+------+---------+--------------+-------+

My simple draft code:

abnrm_list = []
for i in df.proj.unique():
    d = df[df['proj'] == i]
    last_week = d['weeknum'].max()
    cs_hrs = d[d['weeknum'] == last_week]['value'].values[0]
    while cs_hrs == 0:
        abnrm_list.append(d[d['weeknum'] == last_week].index[0])
        last_week = last_week - 1
        cs_hrs = d[d['weeknum'] == last_week]['value'].values[0]

But I'm sure should be some simpler clearer solution to this.

Also, need to update "weekduration" to new value after dropping zero rows.

jezrael · Accepted Answer

Use boolean indexing with filter out last 0 values compared by Series.ne in helper Series created by swapping order of columns with GroupBy.cumsum:

mask = df['value'].iloc[::-1].groupby(df['proj']).cumsum().iloc[::-1].ne(0)
df = df[mask].copy()
print (df)
   proj  weeknum  weekduration  value
0     1        1             4      0
1     1        2             4     11
4     2        1             4     10
5     2        2             4     11
6     2        3             4     12

And then use Series.map with Series.value_counts:

df['weekduration'] = df['proj'].map(df['proj'].value_counts())
print (df)
   proj  weeknum  weekduration  value
0     1        1             2      0
1     1        2             2     11
4     2        1             3     10
5     2        2             3     11
6     2        3             3     12

Details:

print (df['value'].iloc[::-1])
7     0
6    12
5    11
4    10
3     0
2     0
1    11
0     0
Name: value, dtype: int64

print (df['value'].iloc[::-1].groupby(df['proj']).cumsum())
7     0
6    12
5    23
4    33
3     0
2     0
1    11
0    11
Name: value, dtype: int64

Performance:

You need avoid filtering per groups, becuse slow:

np.random.seed(123)
N = 1000000

df = pd.DataFrame({'proj':np.random.randint(10000, size=N), 
                   'value': np.random.choice([0,1,2], N)}).sort_values('proj')
print (df)

In [76]: %timeit df.groupby('proj').apply(lambda grp: grp[grp.value[::-1].cumsum().gt(0)[::-1]]).reset_index(level=0, drop=True)
20.2 s ± 4.04 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [78]: %timeit df[df['value'].iloc[::-1].groupby(df['proj']).cumsum().iloc[::-1].ne(0)]
268 ms ± 1.22 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Pandas drop ending tail zeros in column, grouping by another column

Answers (2)

Edit

Related Questions