Reputation: 531
I want to be able to calculate sales forecasts using the last 12 months worth of data for each of the products I sell.
I have time series data with a product name, a month, and a number of purchases in that month. However, some months where there were no sales, there is no data for that month.
My dataframe looks like this:
2014-06 product1 100
2014-07 product1 50
2014-10 product1 120
But I want it to look like this:
2014-06 product1 100
2014-07 product1 50
2014-08 product1
2014-09 product1
2014-10 product1 120
with a row for every month, not just the months with data. What's the most effective way to add the rows where the given month has no sales data?
Upvotes: 0
Views: 311
Reputation: 20695
You can use Dataframe.reindex
after constructing a new PeriodIndex
with the missing months, using pd.date_range
and to_period()
. First, I'll recreate your data, converting your months to instances of Period
:
index = pd.to_datetime(['2014-06', '2014-07', '2014-10']).to_period('M')
data = pd.DataFrame({
'name': 'product1',
'count': [100, 50, 120]
}, index=index)
Now we create a new index with all of the months in the range:
new_index = pd.date_range(
start=index[0].to_timestamp(how='end'),
end=index[-1].to_timestamp(how='end'),
freq='M').to_period()
This looks like:
>>> new_index
PeriodIndex(['2014-06', '2014-07', '2014-08', '2014-09', '2014-10'],
dtype='int64', freq='M')
Such that:
>>> res = data.reindex(new_index, method='backfill')
>>> res
count name
2014-06 100 product1
2014-07 50 product1
2014-08 120 product1
2014-09 120 product1
2014-10 120 product1
You'll notice that both the name
and count
have been backfilled, whereas you only wanted the name
to be backfilled. We can set the count
for the new rows to be NaN
as follows:
ix = new_index.difference(index)
res.loc[ix, 'count'] = None
So that:
>>> res
count name
2014-06 100 product1
2014-07 50 product1
2014-08 NaN product1
2014-09 NaN product1
2014-10 120 product1
Upvotes: 1