Pandas: Replace empty column values with the non-empty value based on a condition

Question

I have a dataset in this format:

and it needs to be grouped by DocumentId and PersonId columns and sorted by StartDate. Which I doing it like this:
df = pd.read_csv(path).sort_values(by=["StartDate"]).groupby(["DocumentId", "PersonId"])

Now if there is row in this group by with DocumentCode RT and EndDate not empty, all other rows need to be filled by that end date. So this result dataset should be following:

I could not figure out a way to do that. I think I can iterate over each groupby subset but how will find the value from the end date and replace it for each row in that subset.

Based on the suggestions to use bfill(). I tried putting it as following:

df["EndDate"] = (
    df.sort_values(by=["StartDate"])
    .groupby(["DocumentId", "PersonId"])["EndDate"]
    .bfill()
)

Above works fine but how can I add the condition for DocumentCode being RT?

Alexander Volkovsky · Accepted Answer

You can calculate the value to use to fill nan inside the apply function.

def fill_end_date(df):
    rt_doc = df[df["DocumentCode"] == "RT"]
    # if there is row in this group by with DocumentCode RT
    if not rt_doc.empty:
        end_date = rt_doc.iloc[0]["EndDate"]
        # and EndDate not empty
        if pd.notnull(end_date):
            # all other rows need to be filled by that end date
            df = df.fillna({"EndDate": end_date})

    return df

df = pd.read_csv(path).sort_values(by=["StartDate"])
df.groupby(["DocumentId", "PersonId"]).apply(fill_end_date).reset_index(drop=True)

Pandas: Replace empty column values with the non-empty value based on a condition

Answers (2)

Related Questions