Reputation: 608
I've created following function do drop ending zeros in column "value" for each "proj". But seems it's, so to speak, "not optimal") Hope someone coould give advise how to make it right.
Input tabble:
+---+------+---------+--------------+-------+
| | proj | weeknum | weekduration | value |
+---+------+---------+--------------+-------+
| 0 | 1 | 1 | 4 | 0 |
| 1 | 1 | 2 | 4 | 11 |
| 2 | 1 | 3 | 4 | 0 |
| 3 | 1 | 4 | 4 | 0 |
| 4 | 2 | 1 | 4 | 10 |
| 5 | 2 | 2 | 4 | 11 |
| 6 | 2 | 3 | 4 | 12 |
| 7 | 2 | 4 | 4 | 0 |
+---+------+---------+--------------+-------+
For each "proj" we have to select and drop weeknum with "value" == zero, but only at the end (sort by "weeknum").
For this example I should get Indexes: 2, 3, 7
+---+------+---------+--------------+-------+
| | proj | weeknum | weekduration | value |
+---+------+---------+--------------+-------+
| 0 | 1 | 1 | 4 | 0 |
| 1 | 1 | 2 | 4 | 11 |
| 4 | 2 | 1 | 4 | 10 |
| 5 | 2 | 2 | 4 | 11 |
| 6 | 2 | 3 | 4 | 12 |
+---+------+---------+--------------+-------+
My simple draft code:
abnrm_list = []
for i in df.proj.unique():
d = df[df['proj'] == i]
last_week = d['weeknum'].max()
cs_hrs = d[d['weeknum'] == last_week]['value'].values[0]
while cs_hrs == 0:
abnrm_list.append(d[d['weeknum'] == last_week].index[0])
last_week = last_week - 1
cs_hrs = d[d['weeknum'] == last_week]['value'].values[0]
But I'm sure should be some simpler clearer solution to this.
Also, need to update "weekduration" to new value after dropping zero rows.
Upvotes: 0
Views: 272
Reputation: 31011
To drop trailing zeroes run:
result = df.groupby('proj').apply(
lambda grp: grp[grp.value[::-1].cumsum().gt(0)[::-1]])\
.reset_index(level=0, drop=True)
The lambda function:
The above procedure is repeated for all proj groups.
All that remains to be done (at this step) is reset_index to drop the top level of the index (grouping keys).
And now the second part - update weekduration:
result.weekduration = result.groupby('proj').weekduration\
.transform(lambda grp: grp.size)
The final result is:
proj weeknum weekduration value
0 1 1 2 0
1 1 2 2 11
4 2 1 3 10
5 2 2 3 11
6 2 3 3 12
Or if you prefer an "all in one" solution, then define the following function to filter a group and at the same time set weekduration:
def myFilter(grp):
grp = grp[grp.value[::-1].cumsum().gt(0)[::-1]]
grp.weekduration = grp.index.size
return grp
Then compute the result, applying this function to each group (grouped by proj):
result = df.groupby('proj').apply(myFilter).reset_index(level=0, drop=True)
Upvotes: 1
Reputation: 863166
Use boolean indexing
with filter out last 0
values compared by Series.ne
in helper Series
created by swapping order of columns with GroupBy.cumsum
:
mask = df['value'].iloc[::-1].groupby(df['proj']).cumsum().iloc[::-1].ne(0)
df = df[mask].copy()
print (df)
proj weeknum weekduration value
0 1 1 4 0
1 1 2 4 11
4 2 1 4 10
5 2 2 4 11
6 2 3 4 12
And then use Series.map
with Series.value_counts
:
df['weekduration'] = df['proj'].map(df['proj'].value_counts())
print (df)
proj weeknum weekduration value
0 1 1 2 0
1 1 2 2 11
4 2 1 3 10
5 2 2 3 11
6 2 3 3 12
Details:
print (df['value'].iloc[::-1])
7 0
6 12
5 11
4 10
3 0
2 0
1 11
0 0
Name: value, dtype: int64
print (df['value'].iloc[::-1].groupby(df['proj']).cumsum())
7 0
6 12
5 23
4 33
3 0
2 0
1 11
0 11
Name: value, dtype: int64
Performance:
You need avoid filtering per groups, becuse slow:
np.random.seed(123)
N = 1000000
df = pd.DataFrame({'proj':np.random.randint(10000, size=N),
'value': np.random.choice([0,1,2], N)}).sort_values('proj')
print (df)
In [76]: %timeit df.groupby('proj').apply(lambda grp: grp[grp.value[::-1].cumsum().gt(0)[::-1]]).reset_index(level=0, drop=True)
20.2 s ± 4.04 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [78]: %timeit df[df['value'].iloc[::-1].groupby(df['proj']).cumsum().iloc[::-1].ne(0)]
268 ms ± 1.22 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Upvotes: 2