Reputation: 13587
I have a dataset in this format:
and it needs to be grouped by DocumentId
and PersonId
columns and sorted by StartDate
. Which I doing it like this:
df = pd.read_csv(path).sort_values(by=["StartDate"]).groupby(["DocumentId", "PersonId"])
Now if there is row in this group by with DocumentCode RT
and EndDate not empty, all other rows need to be filled by that end date. So this result dataset should be following:
I could not figure out a way to do that. I think I can iterate over each groupby subset but how will find the value from the end date and replace it for each row in that subset.
Based on the suggestions to use bfill()
. I tried putting it as following:
df["EndDate"] = (
df.sort_values(by=["StartDate"])
.groupby(["DocumentId", "PersonId"])["EndDate"]
.bfill()
)
Above works fine but how can I add the condition for DocumentCode being RT
?
Upvotes: 0
Views: 254
Reputation: 5597
You could find the empty cells and replace with np.nan
, then fillna
with method='bfill'
df['EndDate'] = df['EndDate'].apply(lambda x: np.nan if x=='' else x)
df['EndDate'].fillna(method = 'bfill', inplace=True)
Alternatively you could iterate through the df from last row to first row, and fill in the EndDate
where necessary:
d = df.loc[df.shape[0]-1, 'EndDate'] #initial condition
for i in range(df.shape[0]-1, -1, -1):
if df.loc[i, 'DocumentCode'] == 'RT':
d = df.loc[i, 'EndDate']
else:
df.loc[i, 'EndDate'] = d
Upvotes: 1
Reputation: 2918
You can calculate the value to use to fill nan inside the apply function.
def fill_end_date(df):
rt_doc = df[df["DocumentCode"] == "RT"]
# if there is row in this group by with DocumentCode RT
if not rt_doc.empty:
end_date = rt_doc.iloc[0]["EndDate"]
# and EndDate not empty
if pd.notnull(end_date):
# all other rows need to be filled by that end date
df = df.fillna({"EndDate": end_date})
return df
df = pd.read_csv(path).sort_values(by=["StartDate"])
df.groupby(["DocumentId", "PersonId"]).apply(fill_end_date).reset_index(drop=True)
Upvotes: 2