Reputation: 13587
I am working with this dataframe where it has to be grouped by DocumentId
and PersonId
. Within that group, if the End Date
column is empty then fill it with the row where DocCode is RT.
DocumentID,PersonID,DocCode,StartDate,Amount,EndDate
120303,110001,FB,5/18/21,245,
120303,110001,TW,5/25/21,460,
120303,110001,RT,6/1/21,,6/6/21
120303,110011,GK,4/1/21,0,
120303,110011,AK,4/8/21,128,
120303,110011,PL,4/12/21,128,
120303,110011,FB,4/16/21,256,
120303,110011,RT,4/28/21,,5/4/21
It works fine but there is another twist to it. Within that group of DocumentId & PersonID
if the amount changes, the next amount's StartDate
will be the previous amount's EndDate
. So intermediate dataframe will look like this:
And then all the rows with duplicate amounts within that group and empty amounts will be collapsed into 1 row.
Final dataset will look like this:
Here is the code I am using to fill up all the empty EndDate
columns from the row where DocCode
is RT
:
df = pd.read_csv(path).sort_values(by=["StartDate"])
df.groupby(["DocumentId", "PersonId"]).apply(fill_end_date).reset_index(drop=True)
def fill_end_date(df):
rt_doc = df[df["DocumentCode"] == "RT"]
# if there is row in this group by with DocumentCode RT
if not rt_doc.empty:
end_date = rt_doc.iloc[0]["EndDate"]
# and EndDate not empty
if pd.notnull(end_date):
# all other rows need to be filled by that end date
df = df.fillna({"EndDate": end_date})
return df
Upvotes: 0
Views: 207
Reputation: 1704
First forward-fill your Amount column values to fill in all blanks. This is done to find when Amount changes. Shift the StartDate values back by 1
, which will be used to fill into EndDate column wherever Amount change is detected:
df.Amount.ffill(inplace=True)
df['StartDateShift'] = df['StartDate'].shift(-1)
>>> df
DocumentID PersonID DocCode StartDate Amount EndDate StartDateShift
0 120303 110001 FB 5/18/21 245.0 NaN 5/25/21
1 120303 110001 TW 5/25/21 460.0 NaN 6/1/21
2 120303 110001 RT 6/1/21 460.0 6/6/21 4/1/21
3 120303 110011 GK 4/1/21 0.0 NaN 4/8/21
4 120303 110011 AK 4/8/21 128.0 NaN 4/12/21
5 120303 110011 PL 4/12/21 128.0 NaN 4/16/21
6 120303 110011 FB 4/16/21 256.0 NaN 4/28/21
7 120303 110011 RT 4/28/21 256.0 5/4/21 NaN
Now get all those rows where Amount changes, and create a mask from it. Use this mask to fill into NaNs
in EndDate with values from StartDateShift:
mask = (df['Amount'] != df['Amount'].shift(-1))
df['EndDate'].fillna(df[mask]['StartDateShift'], inplace=True)
>>> df
DocumentID PersonID DocCode StartDate Amount EndDate StartDateShift
0 120303 110001 FB 5/18/21 245.0 5/25/21 5/25/21
1 120303 110001 TW 5/25/21 460.0 NaN 6/1/21
2 120303 110001 RT 6/1/21 460.0 6/6/21 4/1/21
3 120303 110011 GK 4/1/21 0.0 4/8/21 4/8/21
4 120303 110011 AK 4/8/21 128.0 NaN 4/12/21
5 120303 110011 PL 4/12/21 128.0 4/16/21 4/16/21
6 120303 110011 FB 4/16/21 256.0 NaN 4/28/21
7 120303 110011 RT 4/28/21 256.0 5/4/21 NaN
Back-fill the EndDate values, and then drop duplicates
in Amount column (collapsing empty ones into one row). Also now drop the StartDateShift column, and you will get your final dataframe:
df['EndDate'].bfill(inplace=True)
df = df.drop_duplicates('Amount', keep='first').drop('StartDateShift', axis=1)
>>> df
DocumentID PersonID DocCode StartDate Amount EndDate
0 120303 110001 FB 5/18/21 245.0 5/25/21
1 120303 110001 TW 5/25/21 460.0 6/6/21
3 120303 110011 GK 4/1/21 0.0 4/8/21
4 120303 110011 AK 4/8/21 128.0 4/16/21
6 120303 110011 FB 4/16/21 256.0 5/4/21
In above steps, I didn't groupby
DocumentId and PersonId columns as the logic worked on your sample df without it. But for your actual use-case, you can apply groupby
on DocumentId and PersonId columns and perform all steps above for each group, then concat
in the end:
dfs = []
for i,dfg in df.groupby(['DocumentID','PersonID']):
dfg.Amount.ffill(inplace=True)
dfg['StartDateShift'] = dfg['StartDate'].shift(-1)
dfg['EndDate'].fillna(dfg[~(dfg['Amount']==dfg['Amount'].shift(-1))]['StartDateShift'], inplace=True)
dfg['EndDate'].bfill(inplace=True)
dfg = dfg.drop_duplicates('Amount', keep='first').drop('StartDateShift', axis=1)
dfs.append(dfg)
final_df = pd.concat(dfs)
>>> final_df
DocumentID PersonID DocCode StartDate Amount EndDate
0 120303 110001 FB 5/18/21 245.0 5/25/21
1 120303 110001 TW 5/25/21 460.0 6/6/21
3 120303 110011 GK 4/1/21 0.0 4/8/21
4 120303 110011 AK 4/8/21 128.0 4/16/21
6 120303 110011 FB 4/16/21 256.0 5/4/21
Upvotes: 1