Himanshu Yadav
Himanshu Yadav

Reputation: 13587

Pandas: Replace column values in a group by based on value from the next row

I am working with this dataframe where it has to be grouped by DocumentId and PersonId. Within that group, if the End Date column is empty then fill it with the row where DocCode is RT.enter image description here

DocumentID,PersonID,DocCode,StartDate,Amount,EndDate
120303,110001,FB,5/18/21,245,
120303,110001,TW,5/25/21,460,
120303,110001,RT,6/1/21,,6/6/21
120303,110011,GK,4/1/21,0,
120303,110011,AK,4/8/21,128,
120303,110011,PL,4/12/21,128,
120303,110011,FB,4/16/21,256,
120303,110011,RT,4/28/21,,5/4/21

It works fine but there is another twist to it. Within that group of DocumentId & PersonID if the amount changes, the next amount's StartDate will be the previous amount's EndDate. So intermediate dataframe will look like this:

enter image description here

And then all the rows with duplicate amounts within that group and empty amounts will be collapsed into 1 row.

Final dataset will look like this: enter image description here

Here is the code I am using to fill up all the empty EndDate columns from the row where DocCode is RT:

df = pd.read_csv(path).sort_values(by=["StartDate"])
df.groupby(["DocumentId", "PersonId"]).apply(fill_end_date).reset_index(drop=True)

def fill_end_date(df):
    rt_doc = df[df["DocumentCode"] == "RT"]
    # if there is row in this group by with DocumentCode RT
    if not rt_doc.empty:
        end_date = rt_doc.iloc[0]["EndDate"]
        # and EndDate not empty
        if pd.notnull(end_date):
            # all other rows need to be filled by that end date
            df = df.fillna({"EndDate": end_date})

    return df

Upvotes: 0

Views: 207

Answers (1)

Ank
Ank

Reputation: 1704

First forward-fill your Amount column values to fill in all blanks. This is done to find when Amount changes. Shift the StartDate values back by 1, which will be used to fill into EndDate column wherever Amount change is detected:

df.Amount.ffill(inplace=True)
df['StartDateShift'] = df['StartDate'].shift(-1)

>>> df
   DocumentID  PersonID DocCode StartDate  Amount EndDate StartDateShift
0      120303    110001      FB   5/18/21   245.0     NaN        5/25/21
1      120303    110001      TW   5/25/21   460.0     NaN         6/1/21
2      120303    110001      RT    6/1/21   460.0  6/6/21         4/1/21
3      120303    110011      GK    4/1/21     0.0     NaN         4/8/21
4      120303    110011      AK    4/8/21   128.0     NaN        4/12/21
5      120303    110011      PL   4/12/21   128.0     NaN        4/16/21
6      120303    110011      FB   4/16/21   256.0     NaN        4/28/21
7      120303    110011      RT   4/28/21   256.0  5/4/21            NaN

Now get all those rows where Amount changes, and create a mask from it. Use this mask to fill into NaNs in EndDate with values from StartDateShift:

mask = (df['Amount'] != df['Amount'].shift(-1))
df['EndDate'].fillna(df[mask]['StartDateShift'], inplace=True)

>>> df
   DocumentID  PersonID DocCode StartDate  Amount  EndDate StartDateShift
0      120303    110001      FB   5/18/21   245.0  5/25/21        5/25/21
1      120303    110001      TW   5/25/21   460.0      NaN         6/1/21
2      120303    110001      RT    6/1/21   460.0   6/6/21         4/1/21
3      120303    110011      GK    4/1/21     0.0   4/8/21         4/8/21
4      120303    110011      AK    4/8/21   128.0      NaN        4/12/21
5      120303    110011      PL   4/12/21   128.0  4/16/21        4/16/21
6      120303    110011      FB   4/16/21   256.0      NaN        4/28/21
7      120303    110011      RT   4/28/21   256.0   5/4/21            NaN

Back-fill the EndDate values, and then drop duplicates in Amount column (collapsing empty ones into one row). Also now drop the StartDateShift column, and you will get your final dataframe:

df['EndDate'].bfill(inplace=True)
df = df.drop_duplicates('Amount', keep='first').drop('StartDateShift', axis=1)

>>> df
   DocumentID  PersonID DocCode StartDate  Amount  EndDate
0      120303    110001      FB   5/18/21   245.0  5/25/21
1      120303    110001      TW   5/25/21   460.0   6/6/21
3      120303    110011      GK    4/1/21     0.0   4/8/21
4      120303    110011      AK    4/8/21   128.0  4/16/21
6      120303    110011      FB   4/16/21   256.0   5/4/21

In above steps, I didn't groupby DocumentId and PersonId columns as the logic worked on your sample df without it. But for your actual use-case, you can apply groupby on DocumentId and PersonId columns and perform all steps above for each group, then concat in the end:

dfs = []
for i,dfg in df.groupby(['DocumentID','PersonID']):
    dfg.Amount.ffill(inplace=True)
    dfg['StartDateShift'] = dfg['StartDate'].shift(-1)
    dfg['EndDate'].fillna(dfg[~(dfg['Amount']==dfg['Amount'].shift(-1))]['StartDateShift'], inplace=True)
    dfg['EndDate'].bfill(inplace=True)
    dfg = dfg.drop_duplicates('Amount', keep='first').drop('StartDateShift', axis=1)
    dfs.append(dfg)

final_df = pd.concat(dfs)

>>> final_df
   DocumentID  PersonID DocCode StartDate  Amount  EndDate
0      120303    110001      FB   5/18/21   245.0  5/25/21
1      120303    110001      TW   5/25/21   460.0   6/6/21
3      120303    110011      GK    4/1/21     0.0   4/8/21
4      120303    110011      AK    4/8/21   128.0  4/16/21
6      120303    110011      FB   4/16/21   256.0   5/4/21

Upvotes: 1

Related Questions