Reputation: 984
I have a scenario in which I have a dataframe which contains 4 columns:
date, product, store, sales_amt
1/1/2019, A,A,200
1/1/2019,A,B,120
1/2/2019, A,A,75
1/3/2019,A,A,69
1/3/2019,A,B,23
--
--
--
1/31/2019,A,B,49
The dates are supposed to span a whole month (e.g. in this case, January 2019) but there are some missing days in the dataframe.
Does anyone have any tips on Python code that can loop through the dates for a particular month and add a new row to the dataframe with the missing date
, product
/store
combination and a sales_amt
of zero?
For example, there is no entry for the product
/store
combination of A/B on 1/2/2019.
The goal at the end is to have an entry for every day of that month for every product
/store
combination.
How can I do this?
Upvotes: 2
Views: 935
Reputation: 28233
how can i still ensure that i have a a date entry for each store/product entry?
The trick I'd use is to pivot the column combination for which I missing values are desired, then resample
to generate the missing dates, fill the na
with 0 & finally reshape back to original shape & reset_index
Here's a short script demonstrating this:
import pandas as pd
import numpy as np
products = ['tablet', 'laptop', 'phone']
stores = ['downtown', 'subburb', 'supermall']
date_range = pd.date_range('2019-01-01', '2019-03-31')
# create a sample data frame
df = pd.DataFrame({
'date': date_range,
'product': np.random.choice(products, len(date_range)),
'store': np.random.choice(stores, len(date_range)),
'sales_amt': np.random.normal(50, 10, len(date_range))
})
# remove some dates
df = df[~df.date.isin(['2019-01-10', '2019-01-11', '2019-02-07'])]
# set date as index, pivot product & store, fill na with 0
# reindex & unpivot
shape2 = df.set_index(['date', 'product', 'store']).\
unstack([1, 2]).\
resample('D').asfreq().\
fillna(0).\
stack([1, 2]).\
reset_index()
print('%d unique dates in original df' % df['date'].nunique())
print('%d rows in original df' % len(df))
print('%d unique dates after filling missing values' % shape2['date'].nunique())
print('%d rows in after filling missing values' % len(shape2))
shape2.head()
It is also possible to do this through a left join
, where first a dataframe of all the desired combinations is created, and then the source data frame is left joined
. This approach takes marginally longer to express, but would be more intuitive for folks coming from an SQL mind-set.
cross_product = pd.MultiIndex.from_product([
pd.date_range(df.date.min(), df.date.max()),
df['product'].unique(),
df.store.unique()]
).to_frame().\
reset_index(drop=True).\
rename({0:'date', 1:'product', 2:'store'}, axis=1)
final_df = cross_product.merge(df,
left_on=['date', 'product', 'store'],
right_on=['date', 'product', 'store'],
how='left').fillna(0)
Upvotes: 1
Reputation: 141
I believe this can be done by creating an index of dates for entire month (in your case all dates between the first date and the last date) and then left merging it on the original dataframe.
So, if 'df' is the original dataframe containing date, product, store and sales_amt then adding following code should help.
all_dates = pd.DataFrame(pd.date_range(df['date'].min(), df['date'].max()))
all_dates.rename(columns = {0: 'date'}, inplace = True)
df = all_dates.merge(df, on = 'date', how = 'left')
Upvotes: 0
Reputation: 153460
Use resample
after set_index
:
#create a dummy dataframe with data every other day
s=pd.date_range('2019-01-01', '2019-05-01', freq='2D')
df = pd.DataFrame({'Date':s, 'sales_amt':np.random.randint(100,1000,61)})
df.set_index('Date').resample('D').asfreq().fillna(0)
Output:
sales_amt
Date
2019-01-01 996.0
2019-01-02 0.0
2019-01-03 236.0
2019-01-04 0.0
2019-01-05 225.0
... ...
2019-04-27 444.0
2019-04-28 0.0
2019-04-29 756.0
2019-04-30 0.0
2019-05-01 641.0
Upvotes: 3