Reputation: 2773
my df
looks like this:
sprint sprint_created
------ -----------
S100 2020-01-01
S101 2020-01-10
NULL 2020-01-20
NULL 2020-01-31
S101 2020-01-10
...
in the above df
, you can see that some of the sprint
values are NULL
I have another df2
that has sprint
date ranges:
sprint sprint_start sprint_end
------ ----------- ----------
S100 2020-01-01 2020-01-09
S101 2020-01-10 2020-01-19
S102 2020-01-20 2020-01-29
S103 2020-01-30 2020-02-09
S104 2020-02-10 2020-02-19
...
How can I map these data and fill in the NULL
values in the df
by comparing the data in the df2
?
Please note that the shape of df
and df2
are different.
Upvotes: 1
Views: 254
Reputation: 16683
One way would be to melt
and resample
your df2
and create a dictionary to map
back to df1
:
#make sure columns are in datetime format
df1['sprint_created'] = pd.to_datetime(df1['sprint_created'])
df2['sprint_start'] = pd.to_datetime(df2['sprint_start'])
df2['sprint_end'] = pd.to_datetime(df2['sprint_end'])
#melt dataframe of the two date columns and resample by group
new = (df2.melt(id_vars='sprint').drop('variable', axis=1).set_index('value')
.groupby('sprint', group_keys=False).resample('D').ffill().reset_index())
#create dictionary of date and the sprint and map back to df1
dct = dict(zip(new['value'], new['sprint']))
df1['sprint'] = df1['sprint_created'].map(dct)
#or df1['sprint'] = df1['sprint'].fillna(df1['sprint_created'].map(dct))
df1
Out[1]:
sprint sprint_created
0 S100 2020-01-01
1 S101 2020-01-10
2 S102 2020-01-20
3 S103 2020-01-31
4 S101 2020-01-10
Upvotes: 1
Reputation: 26676
I assummed duplicated sprint in df(first dataframe can be dropped). Please advice otherwise if not so. I use merge asof with one day tolerance based on my comparison of the two dfs you provided. Advice otherwise if so
df.assign(sprint=pd.merge_asof( df.drop_duplicates(keep='first'), df1, left_on="sprint_created", right_on="sprint_start", tolerance=pd.Timedelta("1D"))['sprint_y']).dropna()
sprint sprint_created
0 S100 2020-01-01
1 S101 2020-01-10
2 S102 2020-01-20
3 S103 2020-01-31
If your frame has legit multiple sprints as explained above in comments. Please try;
g=df.assign(sprint=pd.merge_asof( df.drop_duplicates(keep='first'), df1, left_on="sprint_created", right_on="sprint_start", tolerance=pd.Timedelta("1D"))['sprint_y'])
g.loc[g.sprint.isna(), 'sprint']=g.groupby('sprint_created').sprint.ffill()
print(g)
sprint sprint_created
0 S100 2020-01-01
1 S101 2020-01-10
2 S102 2020-01-20
3 S103 2020-01-31
4 S101 2020-01-10
Upvotes: 1