Alex_Y
Alex_Y

Reputation: 608

Pandas drop ending tail zeros in column, grouping by another column

I've created following function do drop ending zeros in column "value" for each "proj". But seems it's, so to speak, "not optimal") Hope someone coould give advise how to make it right.

Input tabble:

+---+------+---------+--------------+-------+
|   | proj | weeknum | weekduration | value |
+---+------+---------+--------------+-------+
| 0 |    1 |       1 |            4 |     0 |
| 1 |    1 |       2 |            4 |    11 |
| 2 |    1 |       3 |            4 |     0 |
| 3 |    1 |       4 |            4 |     0 |
| 4 |    2 |       1 |            4 |    10 |
| 5 |    2 |       2 |            4 |    11 |
| 6 |    2 |       3 |            4 |    12 |
| 7 |    2 |       4 |            4 |     0 |
+---+------+---------+--------------+-------+

For each "proj" we have to select and drop weeknum with "value" == zero, but only at the end (sort by "weeknum").

For this example I should get Indexes: 2, 3, 7

+---+------+---------+--------------+-------+
|   | proj | weeknum | weekduration | value |
+---+------+---------+--------------+-------+
| 0 |    1 |       1 |            4 |     0 |
| 1 |    1 |       2 |            4 |    11 |
| 4 |    2 |       1 |            4 |    10 |
| 5 |    2 |       2 |            4 |    11 |
| 6 |    2 |       3 |            4 |    12 |
+---+------+---------+--------------+-------+

My simple draft code:

abnrm_list = []
for i in df.proj.unique():
    d = df[df['proj'] == i]
    last_week = d['weeknum'].max()
    cs_hrs = d[d['weeknum'] == last_week]['value'].values[0]
    while cs_hrs == 0:
        abnrm_list.append(d[d['weeknum'] == last_week].index[0])
        last_week = last_week - 1
        cs_hrs = d[d['weeknum'] == last_week]['value'].values[0]

But I'm sure should be some simpler clearer solution to this.

Also, need to update "weekduration" to new value after dropping zero rows.

Upvotes: 0

Views: 272

Answers (2)

Valdi_Bo
Valdi_Bo

Reputation: 31011

To drop trailing zeroes run:

result = df.groupby('proj').apply(
    lambda grp: grp[grp.value[::-1].cumsum().gt(0)[::-1]])\
    .reset_index(level=0, drop=True)

The lambda function:

  • takes value column in reversed order,
  • computes cumsum,
  • converts to bool (is cumsum > 0),
  • reverts back to the "original" order,
  • uses boolean indexing to drop trailing zeroes.

The above procedure is repeated for all proj groups.

All that remains to be done (at this step) is reset_index to drop the top level of the index (grouping keys).

And now the second part - update weekduration:

result.weekduration = result.groupby('proj').weekduration\
    .transform(lambda grp: grp.size)

The final result is:

   proj  weeknum  weekduration  value
0     1        1             2      0
1     1        2             2     11
4     2        1             3     10
5     2        2             3     11
6     2        3             3     12

Edit

Or if you prefer an "all in one" solution, then define the following function to filter a group and at the same time set weekduration:

def myFilter(grp):
    grp = grp[grp.value[::-1].cumsum().gt(0)[::-1]]
    grp.weekduration = grp.index.size
    return grp

Then compute the result, applying this function to each group (grouped by proj):

result = df.groupby('proj').apply(myFilter).reset_index(level=0, drop=True)

Upvotes: 1

jezrael
jezrael

Reputation: 863166

Use boolean indexing with filter out last 0 values compared by Series.ne in helper Series created by swapping order of columns with GroupBy.cumsum:

mask = df['value'].iloc[::-1].groupby(df['proj']).cumsum().iloc[::-1].ne(0)
df = df[mask].copy()
print (df)
   proj  weeknum  weekduration  value
0     1        1             4      0
1     1        2             4     11
4     2        1             4     10
5     2        2             4     11
6     2        3             4     12

And then use Series.map with Series.value_counts:

df['weekduration'] = df['proj'].map(df['proj'].value_counts())
print (df)
   proj  weeknum  weekduration  value
0     1        1             2      0
1     1        2             2     11
4     2        1             3     10
5     2        2             3     11
6     2        3             3     12

Details:

print (df['value'].iloc[::-1])
7     0
6    12
5    11
4    10
3     0
2     0
1    11
0     0
Name: value, dtype: int64

print (df['value'].iloc[::-1].groupby(df['proj']).cumsum())
7     0
6    12
5    23
4    33
3     0
2     0
1    11
0    11
Name: value, dtype: int64

Performance:

You need avoid filtering per groups, becuse slow:

np.random.seed(123)
N = 1000000

df = pd.DataFrame({'proj':np.random.randint(10000, size=N), 
                   'value': np.random.choice([0,1,2], N)}).sort_values('proj')
print (df)

In [76]: %timeit df.groupby('proj').apply(lambda grp: grp[grp.value[::-1].cumsum().gt(0)[::-1]]).reset_index(level=0, drop=True)
20.2 s ± 4.04 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [78]: %timeit df[df['value'].iloc[::-1].groupby(df['proj']).cumsum().iloc[::-1].ne(0)]
268 ms ± 1.22 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Upvotes: 2

Related Questions