Reputation: 6132
I know that a similiar question to this one has been made, but the solution works when you have only one categorical variable. I have two of those, and MultiIndexes
have always been difficult for me to work with. The thing is, I've got the following dataframe:
Date Product eCommerce Sales
12-10-2018 A 1 10
12-12-2018 A 0 7
12-13-2018 A 1 4
12-15-2018 A 1 2
12-15-2018 A 0 2
12-11-2018 B 1 8
12-13-2018 B 1 6
So I need to know sales with and without eCommerce for every date and fill the dates that are not appearing with zero sales. My desired output would be:
Date Product eCommerce Sales
12-10-2018 A 1 10
12-11-2018 A 1 0
12-12-2018 A 1 0
12-13-2018 A 1 4
12-14-2018 A 1 0
12-15-2018 A 1 2
12-12-2018 A 0 7
12-13-2018 A 0 0
12-14-2018 A 0 0
12-15-2018 A 0 2
12-11-2018 B 1 8
12-12-2018 B 1 0
12-13-2018 B 1 6
Note: It's just one DataFrame, I just added the spaces to differentiate the indexes.
So in the original DataFrame 12-11 and 12-14 are missing for eCommerce Sales of Product A, 12-13 and 12-14 for non-eCommerce Sales of Product A and 12-12 is missing for eCommerce Sales of product B
The thing is, I managed to achieve what I want, through a very inefficient loop that takes about 25 minutes to run:
df_full= pd.DataFrame(columns=df.columns)
for sku in df['Product'].unique():
aux=df.loc[df['Product']==sku]
dates= pd.DataFrame(pd.date_range(start=aux.Date.min(), end=aux.Date.max(),freq='D'),columns=['Date'])
df3 = df.loc[df['id_prod']==sku].merge(dates,
how='outer',left_on='Date',
right_on='Date').sort_values(by='Date')
df3.fillna(method='ffill',inplace=True)
df_full= df_full.append(df3)
I'm quite confident that I can do this in a vectorized way, that should take less time (I have 2,300 products for 290 possible dates). Do you know how can I achieve this?
Edit: Added bolded text with better explanation of the problem
Upvotes: 2
Views: 93
Reputation: 59519
Use a single DatetimeIndex
then groupby
+ resample.asfreq()
, (can use sum for numeric columns) as the date range is group dependent.
import pandas as pd
df['Date'] = pd.to_datetime(df.Date)
df = df.set_index('Date')
df.groupby(['Product', 'eCommerce'], sort=False).Sales.resample('D').sum().reset_index()
If you have many columns you want to fill with zero then:
(df.groupby(['Product', 'eCommerce'], sort=False)
.resample('D').sum()
.drop(columns=['Product', 'eCommerce'])
.reset_index())
Product eCommerce Date Sales
0 A 1 2018-12-10 10
1 A 1 2018-12-11 0
2 A 1 2018-12-12 0
3 A 1 2018-12-13 4
4 A 1 2018-12-14 0
5 A 1 2018-12-15 2
6 A 0 2018-12-12 7
7 A 0 2018-12-13 0
8 A 0 2018-12-14 0
9 A 0 2018-12-15 2
10 B 1 2018-12-11 8
11 B 1 2018-12-12 0
12 B 1 2018-12-13 6
Upvotes: 2