Reputation: 508
I have a dataset of assignments per case. I'm trying to fill the last unassigned
from each case with the case_closed
date.
case_nb attorney cas_closed_date assigned last_event
1 A 2015-07-02 2015-07-02 2015-07-02
2 B 2015-09-10 2015-09-10 2015-09-10
3 C 2016-03-24 2016-03-24 2016-03-24
4 D 2018-06-07 2013-10-21 2014-02-06
4 C 2018-06-07 2013-09-13 2013-09-13
4 F 2018-06-07 2018-03-31 2018-04-05
(For replication:
df = pd.DataFrame({'assigned': {0: pd.Timestamp('2015-07-02'), 1: '2015-09-10', 2: '2016-03-24', 3: '2013-10-21', 4: '2013-09-13', 5: '2018-03-31'},
'attorney': {0: 'A', 1: 'B', 2: 'C', 3: 'D', 4: 'C', 5: 'F'},
'cas_closed_date': {0: pd.Timestamp('2015-07-02'), 1: '2015-09-10', 2: '2016-03-24', 3: '2018-06-07', 4: '2018-06-07', 5: '2018-06-07'},
'case_nb': {0: 1, 1: 2, 2: 3, 3: 4, 4: 4, 5: 4},
'last_event': {0: pd.Timestamp('2015-07-02'), 1: '2015-09-10', 2: '2016-03-24', 3: '2014-02-06', 4: '2013-09-13', 5: '2018-04-05'}}
)
My dataset includes the case number, attorney assigned, the case closed date, the date the attorney was assigned, and the last event the attorney appeared on. For the above example, I'd want the last line to be
4 F 2018-06-07 2018-03-31 2018-06-07
I've seen a few ways of filling in NAs based on data in the same column, such as this question. But these ways use transform
, which I can't get to use multiple columns.
What I have so far works with apply
, but not transform
:
def fixdate(gp):
last_unasgn = gp.iloc[-1]['last_event']
if gp.iloc[-1]['cas_closed_date'] > last_unasgn:
return gp.iloc[-1]['cas_closed_date']
else:
return last_unasgn
asmt.groupby('evt_file_number').apply(lambda x: fixdate(x))
> 4 2018-06-07
Is there a way to use transform
? If not, what's the best way to use that aggregate data to fill in my original dataset?
Upvotes: 1
Views: 157
Reputation: 13437
In the case the same case_nb
is never assigned on the same day to two different attorney you can try this solution.
import numpy as np
import pandas as pd
df = pd.DataFrame({'assigned': {0: pd.Timestamp('2015-07-02'), 1: '2015-09-10',
2: '2016-03-24', 3: '2013-10-21', 4: '2013-09-13',
5: '2018-03-31'},
'attorney': {0: 'A', 1: 'B', 2: 'C', 3: 'D', 4: 'C', 5: 'F'},
'cas_closed_date': {0: pd.Timestamp('2015-07-02'), 1: '2015-09-10',
2: '2016-03-24', 3: '2018-06-07', 4: '2018-06-07',
5: '2018-06-07'},
'case_nb': {0: 1, 1: 2, 2: 3, 3: 4, 4: 4, 5: 4},
'last_event': {0: pd.Timestamp('2015-07-02'), 1: '2015-09-10',
2: '2016-03-24', 3: '2014-02-06', 4: '2013-09-13',
5: '2018-04-05'}})
out = df.groupby("case_nb")["assigned"].max().reset_index()
out["last"] = True
df = pd.merge(df, out, how="left", on= ["case_nb", "assigned"])
df["last_event"] = np.where(df["last"] == True, df["cas_closed_date"], df["last_event"])
del df["last"], out
Probably there is a more elegant solution but at least this is vectorial (see np.where
) and don't use apply
. pandas performaces.
EDIT In case you want to use transform
there is this option which seems to me to be the fastest solution.
df["last_assigned"] = df.groupby("case_nb")["assigned"].transform("max")
df["last_event"] = np.where(df["assigned"]==df["last_assigned"],
df["cas_closed_date"],
df["last_event"])
del df["last_assigned"]
Upvotes: 1
Reputation: 508
I ended up doing a combination of transform
and just plain indexing.
df = df.sort_values(['case_nb', 'last_event'])
df['last_unassigned'] = df.groupby('case_nb')['last_event'].transform('last')
df.loc[(df['last_event'] == df['last_unassigned'])
& (df['last_unassigned'] != df['cas_closed_date']), 'last_event'] = df['cas_closed_date']
Upvotes: 0
Reputation: 76
The .transform
operation passes all columns for each group as a one dimensional Series to the function, while the .apply
operation passes all columns for each group as a DataFrame to the function. So, .transform
works on one Series (or column) at a time and .apply
works for an entire DataFrame (or all columns). Hope this helps.
Upvotes: 1