Quetzalcoatl
Quetzalcoatl

Reputation: 2146

Pandas fill-in-zero for by missing dates *defined by* group

Question: Using pandas -- how to efficiently fill-in missing dates with zero values, with monthly (e.g. last day indexed) frequency, relative to the min/max date values per group?

Edit do not assume input dates to correspond to the last day in its month. To deal with this issue, add the following line to the proposed answers below:

df.date = df.date + pd.offsets.MonthEnd(0) 

Without this fix, filled in values with freq='M' can results in NA's...!

Note: pandas version 0.24.2

Example input:

data = [{'name': 'A', 'date': '2019-01-01', 'val': 10},
{'name': 'A', 'date': '2019-04-30', 'val': 2},
{'name': 'B', 'date': '2019-02-15', 'val': 6},
{'name': 'B', 'date': '2019-05-01', 'val': 5}]

df = pd.DataFrame(data)

         date name  val
0  2019-01-01    A   10
1  2019-04-30    A    2
2  2019-02-15    B    6
3  2019-05-01    B    5

Note that the dates in the input are not necessarily the first or last day of their corresponding month.

Example desired output

         date name  val
0  2019-01-31    A   10
1  2019-02-28    A    0
2  2019-03-31    A    0
3  2019-04-30    A    2
4  2019-02-28    B    6
5  2019-03-31    B    0
6  2019-04-30    B    0
7  2019-05-31    B    5

Attempts:

The following works at the index level but fills everything with NA:

df['date'] = pd.to_datetime(df['date'])

dg = df.groupby('name').apply(lambda x: x.reindex(pd.date_range(min(x.date), max(x.date), freq='M')))

Also:

Pandas filling missing dates and values within group

The above link's answer appear not be not relative to for each group, but rather the entire dataset's min/max date values.

Upvotes: 4

Views: 1823

Answers (2)

Andy L.
Andy L.

Reputation: 25249

I would use groupby, resample and asfreq (edit: as you updated the question on non-MonthEnd dates. I added pd.offsets.MonthEnd as you proposed)

df.date = df.date + pd.offsets.MonthEnd(0) 
(df.set_index('date').groupby('name').resample('M')
                     .asfreq(fill_value=0).drop('name',1)
                     .reset_index())

Out[550]:
  name       date  val
0    A 2019-01-31   10
1    A 2019-02-28    0
2    A 2019-03-31    0
3    A 2019-04-30    2
4    B 2019-02-28    6
5    B 2019-03-31    0
6    B 2019-04-30    0
7    B 2019-05-31    5

Upvotes: 2

Quang Hoang
Quang Hoang

Reputation: 150765

One quick fix:

df.date = pd.to_datetime(df.date)

new_df = (df.set_index('date')
            .groupby('name', as_index=False)
            .apply(lambda x: x.resample('M').interpolate())
            .reset_index(0, drop=True)
         )

s = new_df['name'].isna() 
new_df.loc[s, 'val'] = 0
new_df['name'] = new_df['name'].ffill()

Output:

           name   val
date                 
2019-01-31    A  10.0
2019-02-28    A   0.0
2019-03-31    A   0.0
2019-04-30    A   2.0
2019-02-28    B   6.0
2019-03-31    B   0.0
2019-04-30    B   0.0
2019-05-31    B   5.0

Upvotes: 1

Related Questions