Reputation: 2146
Question: Using pandas -- how to efficiently fill-in missing dates with zero values, with monthly (e.g. last day indexed) frequency, relative to the min/max date values per group?
Edit do not assume input dates to correspond to the last day in its month. To deal with this issue, add the following line to the proposed answers below:
df.date = df.date + pd.offsets.MonthEnd(0)
Without this fix, filled in values with freq='M' can results in NA's...!
Note: pandas version 0.24.2
Example input:
data = [{'name': 'A', 'date': '2019-01-01', 'val': 10},
{'name': 'A', 'date': '2019-04-30', 'val': 2},
{'name': 'B', 'date': '2019-02-15', 'val': 6},
{'name': 'B', 'date': '2019-05-01', 'val': 5}]
df = pd.DataFrame(data)
date name val
0 2019-01-01 A 10
1 2019-04-30 A 2
2 2019-02-15 B 6
3 2019-05-01 B 5
Note that the dates in the input are not necessarily the first or last day of their corresponding month.
Example desired output
date name val
0 2019-01-31 A 10
1 2019-02-28 A 0
2 2019-03-31 A 0
3 2019-04-30 A 2
4 2019-02-28 B 6
5 2019-03-31 B 0
6 2019-04-30 B 0
7 2019-05-31 B 5
Attempts:
The following works at the index level but fills everything with NA:
df['date'] = pd.to_datetime(df['date'])
dg = df.groupby('name').apply(lambda x: x.reindex(pd.date_range(min(x.date), max(x.date), freq='M')))
Also:
Pandas filling missing dates and values within group
The above link's answer appear not be not relative to for each group, but rather the entire dataset's min/max date values.
Upvotes: 4
Views: 1823
Reputation: 25249
I would use groupby
, resample
and asfreq
(edit: as you updated the question on non-MonthEnd dates. I added pd.offsets.MonthEnd
as you proposed)
df.date = df.date + pd.offsets.MonthEnd(0)
(df.set_index('date').groupby('name').resample('M')
.asfreq(fill_value=0).drop('name',1)
.reset_index())
Out[550]:
name date val
0 A 2019-01-31 10
1 A 2019-02-28 0
2 A 2019-03-31 0
3 A 2019-04-30 2
4 B 2019-02-28 6
5 B 2019-03-31 0
6 B 2019-04-30 0
7 B 2019-05-31 5
Upvotes: 2
Reputation: 150765
One quick fix:
df.date = pd.to_datetime(df.date)
new_df = (df.set_index('date')
.groupby('name', as_index=False)
.apply(lambda x: x.resample('M').interpolate())
.reset_index(0, drop=True)
)
s = new_df['name'].isna()
new_df.loc[s, 'val'] = 0
new_df['name'] = new_df['name'].ffill()
Output:
name val
date
2019-01-31 A 10.0
2019-02-28 A 0.0
2019-03-31 A 0.0
2019-04-30 A 2.0
2019-02-28 B 6.0
2019-03-31 B 0.0
2019-04-30 B 0.0
2019-05-31 B 5.0
Upvotes: 1