Himanshu Yadav
Himanshu Yadav

Reputation: 13587

Pandas: Replace empty column values with the non-empty value based on a condition

I have a dataset in this format:
enter image description here

and it needs to be grouped by DocumentId and PersonId columns and sorted by StartDate. Which I doing it like this:
df = pd.read_csv(path).sort_values(by=["StartDate"]).groupby(["DocumentId", "PersonId"])

Now if there is row in this group by with DocumentCode RT and EndDate not empty, all other rows need to be filled by that end date. So this result dataset should be following:
enter image description here

I could not figure out a way to do that. I think I can iterate over each groupby subset but how will find the value from the end date and replace it for each row in that subset.

Based on the suggestions to use bfill(). I tried putting it as following:

df["EndDate"] = (
    df.sort_values(by=["StartDate"])
    .groupby(["DocumentId", "PersonId"])["EndDate"]
    .bfill()
)

Above works fine but how can I add the condition for DocumentCode being RT?

Upvotes: 0

Views: 254

Answers (2)

blackraven
blackraven

Reputation: 5597

You could find the empty cells and replace with np.nan, then fillna with method='bfill'

df['EndDate'] = df['EndDate'].apply(lambda x: np.nan if x=='' else x)
df['EndDate'].fillna(method = 'bfill', inplace=True)

Alternatively you could iterate through the df from last row to first row, and fill in the EndDate where necessary:

d = df.loc[df.shape[0]-1, 'EndDate']   #initial condition
for i in range(df.shape[0]-1, -1, -1):
    if df.loc[i, 'DocumentCode'] == 'RT':
        d = df.loc[i, 'EndDate']
    else:
        df.loc[i, 'EndDate'] = d

Upvotes: 1

Alexander Volkovsky
Alexander Volkovsky

Reputation: 2918

You can calculate the value to use to fill nan inside the apply function.

def fill_end_date(df):
    rt_doc = df[df["DocumentCode"] == "RT"]
    # if there is row in this group by with DocumentCode RT
    if not rt_doc.empty:
        end_date = rt_doc.iloc[0]["EndDate"]
        # and EndDate not empty
        if pd.notnull(end_date):
            # all other rows need to be filled by that end date
            df = df.fillna({"EndDate": end_date})

    return df

df = pd.read_csv(path).sort_values(by=["StartDate"])
df.groupby(["DocumentId", "PersonId"]).apply(fill_end_date).reset_index(drop=True)

Upvotes: 2

Related Questions