Reputation: 7503
I have a dataframe as below
*------------------------------------------------------------*
| started act_id from_state to_state|
*------------------------------------------------------------*
|2019-11-06 05:49:39.571392 2 CREATED ENABLED |
|2019-11-25 22:20:59.150339 2 ENABLED DISABLED |
|2019-11-26 10:22:36.571392 2 DISABLED ENABLED |
|2019-11-14 14:57:02.571392 3 CREATED ENABLED |
|2019-12-06 16:03:44.255603 3 ENABLED DISABLED |
|2019-12-12 12:50:48.571392 3 DISABLED ENABLED |
*------------------------------------------------------------*
I want to calculate total time in days by act_id
to show how long that act_id
stayed in to_state
. So how long act_id
in status ENABLED or DISABLED before it changed it state from ENABLED to DISABLED?
here is my code
import pandas as pd
import numpy as np
df = pd.read_csv('transitions.csv', index_col=0)
df['started'] = pd.to_datetime(df['started'])
df['total_time'] = 0
df['total_time'] = df.groupby(['account_id', 'from_state', 'to_state'])['started'].diff()/np.timedelta64(1, 'D')
df
but when it is giving me output in my new field total_time
as NaN
instead of showing in days
*------------------------------------------------------------------------------*
| started act_id from_state to_state total_time |
*------------------------------------------------------------------------------*
|2019-11-06 05:49:39.571392 2 CREATED ENABLED NaN |
|2019-11-25 22:20:59.150339 2 ENABLED DISABLED NaN |
|2019-11-26 10:22:36.571392 2 DISABLED ENABLED NaN |
|2019-11-14 14:57:02.571392 3 CREATED ENABLED NaN |
|2019-12-06 16:03:44.255603 3 ENABLED DISABLED NaN |
|2019-12-12 12:50:48.571392 3 DISABLED ENABLED NaN |
*------------------------------------------------------------------------------*
I want my expected output as
*------------------------------------------------------------------------------*
| started act_id from_state to_state total_time |
*------------------------------------------------------------------------------*
|2019-11-06 05:49:39.571392 2 CREATED ENABLED 0 |
|2019-11-25 22:20:59.150339 2 ENABLED DISABLED 19 |
|2019-11-26 10:22:36.571392 2 DISABLED ENABLED 1 |
|2019-11-14 14:57:02.571392 3 CREATED ENABLED 0 |
|2019-12-06 16:03:44.255603 3 ENABLED DISABLED 22 |
|2019-12-12 12:50:48.571392 3 DISABLED ENABLED 6 |
*------------------------------------------------------------------------------*
where I am doing wrong?
Upvotes: 1
Views: 708
Reputation: 11
df['started'] = pd.to_datetime(df['started'])
df = df.merge(pd.DataFrame( pd.DataFrame( df.groupby(['act_id', 'from_state', 'to_state']).count())), how='outer', indicator=False, on=['act_id', 'from_state', 'to_state'] )
You might need to rename your dataframe after merge accordingly. Hope this would give you the answer
Upvotes: 1
Reputation: 862751
I think problem here if grouping by all 3 columns each group contains only one row, so difference is always NaT
.
But if grouping by ID
only:
df['started'] = pd.to_datetime(df['started'])
df['total_time'] = (df.groupby('act_id')['started'].diff()/np.timedelta64(1, 'D')).fillna(0)
print (df)
started act_id from_state to_state total_time
0 2019-11-06 05:49:39.571392 2 CREATED ENABLED 0.000000
1 2019-11-25 22:20:59.150339 2 ENABLED DISABLED 19.688421
2 2019-11-26 10:22:36.571392 2 DISABLED ENABLED 0.501128
3 2019-11-14 14:57:02.571392 3 CREATED ENABLED 0.000000
4 2019-12-06 16:03:44.255603 3 ENABLED DISABLED 22.046316
5 2019-12-12 12:50:48.571392 3 DISABLED ENABLED 5.866022
If need test also from
and to
state is possible shift
column to_state
per ID
, first values replace by from_state
and compare both columns if equal, then mask pass to last line of code:
df['started'] = pd.to_datetime(df['started'])
df['to_state1'] = df.groupby('act_id')['to_state'].shift().fillna(df['from_state'])
print (df)
started act_id from_state to_state to_state1
0 2019-11-06 05:49:39.571392 2 CREATED ENABLED CREATED
1 2019-11-25 22:20:59.150339 2 ENABLED DISABLED ENABLED
2 2019-11-26 10:22:36.571392 2 DISABLED ENABLED DISABLED
3 2019-11-14 14:57:02.571392 3 CREATED ENABLED CREATED
4 2019-12-06 16:03:44.255603 3 ENABLED DISABLED ENABLED
5 2019-12-12 12:50:48.571392 3 DISABLED ENABLED DISABLED
m = df['from_state'].eq(df['to_state1'])
print (m)
0 True
1 True
2 True
3 True
4 True
5 True
dtype: bool
df['total_time'] = (df[m].groupby('act_id')['started'].diff()/np.timedelta64(1, 'D')).fillna(0)
print (df)
started act_id from_state to_state to_state1 \
0 2019-11-06 05:49:39.571392 2 CREATED ENABLED CREATED
1 2019-11-25 22:20:59.150339 2 ENABLED DISABLED ENABLED
2 2019-11-26 10:22:36.571392 2 DISABLED ENABLED DISABLED
3 2019-11-14 14:57:02.571392 3 CREATED ENABLED CREATED
4 2019-12-06 16:03:44.255603 3 ENABLED DISABLED ENABLED
5 2019-12-12 12:50:48.571392 3 DISABLED ENABLED DISABLED
total_time
0 0.000000
1 19.688421
2 0.501128
3 0.000000
4 22.046316
5 5.866022
Upvotes: 1