Reputation: 147
Each record(name) has date and status(begin/processing/finished). How to get the date of Begin status for each row? Thank you.
date name status
0 2020-10-01 name_01 Begin
1 2020-10-02 name_02 Begin
2 2020-10-03 name_01 Processing
3 2020-10-04 name_03 Begin
4 2020-10-05 name_02 Processing
5 2020-10-06 name_01 Finished
6 2020-10-07 name_02 Finished
7 2020-10-08 name_03 Processing
8 2020-10-09 name_03 Finished
I need this:
date name status begin_at
0 2020-10-01 name_01 Begin 2020-10-01
1 2020-10-02 name_02 Begin 2020-10-02
2 2020-10-03 name_01 Processing 2020-10-01
3 2020-10-04 name_03 Begin 2020-10-04
4 2020-10-05 name_02 Processing 2020-10-02
5 2020-10-06 name_01 Finished 2020-10-01
6 2020-10-07 name_02 Finished 2020-10-02
7 2020-10-08 name_03 Processing 2020-10-04
8 2020-10-09 name_03 Finished 2020-10-04
Sorry, I did not mention that the Names can restart it's status. For example, name_01 will appear "Begin" status again. See 9 and 10
Like this
date name status begin_at
0 2020-10-01 name_01 Begin 2020-10-01
1 2020-10-02 name_02 Begin 2020-10-02
2 2020-10-03 name_01 Processing 2020-10-01
3 2020-10-04 name_03 Begin 2020-10-04
4 2020-10-05 name_02 Processing 2020-10-02
5 2020-10-06 name_01 Finished 2020-10-01
6 2020-10-07 name_02 Finished 2020-10-02
7 2020-10-08 name_03 Processing 2020-10-04
8 2020-10-09 name_03 Finished 2020-10-04
9 2020-10-10 name_01 Begin 2020-10-10
10 2020-10-11 name_01 Processing 2020-10-10
Therefore, it is not just to find the only one "Begin" row of same name. The date of the latest record of "Begin" status of same name should have to be found.
Sorry for my poor English expression.
Sample Data
date name status
0 2020-10-01 name_01 Begin
1 2020-10-02 name_02 Begin
2 2020-10-03 name_01 Processing
3 2020-10-05 name_02 Processing
4 2020-10-06 name_03 Begin
5 2020-10-07 name_01 Finished
6 2020-10-08 name_02 Finished
7 2020-10-09 name_03 Processing
8 2020-10-10 name_03 Finished
9 2020-10-11 name_01 Begin
10 2020-10-12 name_01 Processing
11 2020-10-13 name_02 Begin
12 2020-10-14 name_02 Processing
13 2020-10-15 name_02 Finished
14 2020-10-16 name_01 Finished
Expect effect
date name status begin_at
0 2020-10-01 name_01 Begin 2020-10-01
1 2020-10-02 name_02 Begin 2020-10-02
2 2020-10-03 name_01 Processing 2020-10-01
3 2020-10-05 name_02 Processing 2020-10-01
4 2020-10-06 name_03 Begin 2020-10-06
5 2020-10-07 name_01 Finished 2020-10-01
6 2020-10-08 name_02 Finished 2020-10-05
7 2020-10-09 name_03 Processing 2020-10-06
8 2020-10-10 name_03 Finished 2020-10-06
9 2020-10-11 name_01 Begin 2020-10-11
10 2020-10-12 name_01 Processing 2020-10-11
11 2020-10-13 name_02 Begin 2020-10-13
12 2020-10-14 name_02 Processing 2020-10-13
13 2020-10-15 name_02 Finished 2020-10-13
14 2020-10-16 name_01 Finished 2020-10-11
I tried to run the code
df['begin_at'] = df.groupby('name').apply(lambda grp:
grp.groupby((grp.status == 'Begin').cumsum(), as_index=False)
.date.transform('first'))
but it gave
date name status begin_at
0 2020-10-01 name_01 Begin 2020-10-11
1 2020-10-02 name_02 Begin 2020-10-13
2 2020-10-03 name_01 Processing 2020-10-11
3 2020-10-05 name_02 Processing 2020-10-13
4 2020-10-06 name_03 Begin NaT
5 2020-10-07 name_01 Finished 2020-10-11
6 2020-10-08 name_02 Finished 2020-10-13
7 2020-10-09 name_03 Processing NaT
8 2020-10-10 name_03 Finished NaT
9 2020-10-11 name_01 Begin NaT
10 2020-10-12 name_01 Processing NaT
11 2020-10-13 name_02 Begin NaT
12 2020-10-14 name_02 Processing NaT
13 2020-10-15 name_02 Finished NaT
14 2020-10-16 name_01 Finished NaT
Here is the whole code
import numpy as np
import pandas as pd
df = pd.DataFrame([
["2020-10-01", "name_01", "Begin"],
["2020-10-02", "name_02", "Begin"],
["2020-10-03", "name_01", "Processing"],
["2020-10-05", "name_02", "Processing"],
["2020-10-06", "name_03", "Begin"],
["2020-10-07", "name_01", "Finished"],
["2020-10-08", "name_02", "Finished"],
["2020-10-09", "name_03", "Processing"],
["2020-10-10", "name_03", "Finished"],
["2020-10-11", "name_01", "Begin"],
["2020-10-12", "name_01", "Processing"],
["2020-10-13", "name_02", "Begin"],
["2020-10-14", "name_02", "Processing"],
["2020-10-15", "name_02", "Finished"],
["2020-10-16", "name_01", "Finished"],
], columns=["date", "name", "status"])
df['date'] = pd.to_datetime(df.date)
df = df.sort_values(by="date")
print(df)
df['begin_at'] = df.groupby('name').apply(lambda grp:
grp.groupby(
(grp.status == 'Begin').cumsum(), as_index=False)
.date.transform('first'))
print(df)
Upvotes: 1
Views: 939
Reputation: 30981
Create an auxiliary Series:
begin_at = df[df.status == 'Begin'].set_index('name').date.rename('begin_at')
Then join your DataFrame with it:
result = df.join(begin_at, on='name')
The result is:
date name status begin_at
0 2020-10-01 name_01 Begin 2020-10-01
1 2020-10-02 name_02 Begin 2020-10-02
2 2020-10-03 name_01 Processing 2020-10-01
3 2020-10-04 name_03 Begin 2020-10-04
4 2020-10-05 name_02 Processing 2020-10-02
5 2020-10-06 name_01 Finished 2020-10-01
6 2020-10-07 name_02 Finished 2020-10-02
7 2020-10-08 name_03 Processing 2020-10-04
8 2020-10-09 name_03 Finished 2020-10-04
Or, if you don't need the original DataFrame any more, save the result back under df.
Your post contained only a single cycle of Begin, Processing and Finished events for the same name. But if there are multiple such cycles (at least for one name), a different approach is needed:
df['begin_at'] = df.groupby('name').apply(lambda grp: grp.groupby(
(grp.status == 'Begin').cumsum()).date.transform('first'))\
.reset_index(level=0, drop=True)
It consists a two-level grouping.
Then, within each second-level group, for all member rows the first date is generated.
An additional step is to drop the top level of MultiIndex, added by grouping. Initially I attempted to avoid this additional index level by passing as_index=False, but apparently sometimes this arrangement fails.
And the whole result is saved under the new column.
I have found a shorter and sipler solution.
Create an auxiliary Series with begin dates only:
begin_at = df[df.status == 'Begin'].set_index('name').date.rename('begin_at')
The result is:
name
name_01 2020-10-01
name_02 2020-10-02
name_03 2020-10-06
name_01 2020-10-11
name_02 2020-10-13
Name: begin_at, dtype: datetime64[ns]
Then merge ("asof" version):
result = pd.merge_asof(df, begin_at, by='name', left_on='date', right_on='begin_at')
This operation is actually broken down into 2 steps:
Using %timeit check the execution time of each variant, on some bigger source data sample. I suppose that the last variant will run faster than my earlier variants.
Upvotes: 1
Reputation: 592
Assuming the Begin
date is always <= Processing
or Finished
:
>>> df.assign(begin_at=df.groupby('name').date.transform(min))
date name status begin_at
0 2020-10-01 name_01 Begin 2020-10-01
1 2020-10-02 name_02 Begin 2020-10-02
2 2020-10-03 name_01 Processing 2020-10-01
3 2020-10-04 name_03 Begin 2020-10-04
4 2020-10-05 name_02 Processing 2020-10-02
5 2020-10-06 name_01 Finished 2020-10-01
6 2020-10-07 name_02 Finished 2020-10-02
7 2020-10-08 name_03 Processing 2020-10-04
8 2020-10-09 name_03 Finished 2020-10-04
Upvotes: 2
Reputation: 25239
Take advantage of alphabetical order of begin
, processing
, finished
, Use sort_values
and groupby transform
first
df['begin_at'] = df.sort_values('status').groupby('name').date.transform('first')
Out[719]:
date name status begin_at
0 2020-10-01 name_01 Begin 2020-10-01
1 2020-10-02 name_02 Begin 2020-10-02
2 2020-10-03 name_01 Processing 2020-10-01
3 2020-10-04 name_03 Begin 2020-10-04
4 2020-10-05 name_02 Processing 2020-10-02
5 2020-10-06 name_01 Finished 2020-10-01
6 2020-10-07 name_02 Finished 2020-10-02
7 2020-10-08 name_03 Processing 2020-10-04
8 2020-10-09 name_03 Finished 2020-10-04
Upvotes: 2