Reputation: 2122
I have a dataframe like this:
timestamp | value | id. |
---|---|---|
2020-12-16 | 25 | 1 |
2020-12-17 | 45 | 1 |
2020-12-31 | 40 | 1 |
2021-01-31 | 37 | 1 |
2020-12-15 | 12 | 2 |
2020-12-16 | 78. | 2. |
I want to create a new column outcome
which takes a value yes
is the id
doesn't have any entry for the last 25 days.
For e.g., this is the expected output
timestamp | value | id. | outcome |
---|---|---|---|
2020-12-16 | 25 | 1 | yes |
2020-12-17 | 45 | 1 | yes. |
2020-12-31 | 40 | 1 | yes. |
2021-01-31 | 37 | 1 | no. |
2020-12-15 | 12 | 2 | yes. |
2020-12-16 | 78. | 2. | yes. |
In the above example, we can assume that the start date is 2020-12-15
. So, our comparisons start after this date for all the id
s.
For id 1
, starting from 2020-12-15
, it has entries in the last 25 days
except for 2021-01-31
. The entry that happened before 2021-01-31
is on 2020-12-31
which is > 25 days
in this example. So, the outcome
variable takes a value no
. For the rest of the rows, it is yes
because for each of the dates, there's a prior entry that falls within the 25 day difference.
I am very confused about how to write an algorithm for this.
I would have to groupby
id
, order by
timestamp
and then do rolling
something on the timestamp
to create the outcome
column?
I am very confused about this and any help will be much appreciated!
Thanks in advance!
Upvotes: 2
Views: 344
Reputation: 26686
Apply conditions IF;
1.the difference between consecutive days in a group is less than 15 OR 2. The first item in a group are equal to yes. Any other elements not meeting any of the conditions equal no.
import numpy as np
df['outcome']=np.where(df.groupby('id.')['timestamp'].transform('diff').dt.days.lt(15)|~df['id.'].duplicated(keep='first'),'yes','no')
timestamp value id. outcome
0 2020-12-16 25.0 1.0 yes
1 2020-12-17 45.0 1.0 yes
2 2020-12-31 40.0 1.0 yes
3 2021-01-31 37.0 1.0 no
4 2020-12-15 12.0 2.0 yes
5 2020-12-16 78.0 2.0 yes
Upvotes: 0
Reputation: 59579
Because your calculation requires sorting we can avoid grouping. Sort, take a row-difference and use where
to NaN
the values that cross groups (i.e. the earliest row for every ID). Because you want the first difference to be relative to '2020-12-15'
we can use fillna
to find the difference from that date and use np.where
to assign your strings values based on your condition.
import pandas as pd
import numpy as np
df['timestamp'] = pd.to_datetime(df['timestamp'])
df = df.sort_values(['id', 'timestamp'])
s = (df['timestamp'].diff()
.where(df['id'].eq(df['id'].shift()))
.fillna(df['timestamp'] - pd.to_datetime('2020-12-15')))
#0 1 days
#1 1 days
#2 14 days
#3 31 days
#4 0 days
#5 1 days
df['outcome'] = np.where(s <= pd.Timedelta(25, 'D'), 'yes', 'no')
# timestamp value id outcome
#0 2020-12-16 25 1 yes
#1 2020-12-17 45 1 yes
#2 2020-12-31 40 1 yes
#3 2021-01-31 37 1 no
#4 2020-12-15 12 2 yes
#5 2020-12-16 78 2 yes
Upvotes: 2