Juan C
Juan C

Reputation: 6132

Filling dates in dataframe with triple index

I know that a similiar question to this one has been made, but the solution works when you have only one categorical variable. I have two of those, and MultiIndexes have always been difficult for me to work with. The thing is, I've got the following dataframe:

 Date        Product    eCommerce   Sales
12-10-2018      A           1        10
12-12-2018      A           0        7
12-13-2018      A           1        4
12-15-2018      A           1        2
12-15-2018      A           0        2
12-11-2018      B           1        8
12-13-2018      B           1        6

So I need to know sales with and without eCommerce for every date and fill the dates that are not appearing with zero sales. My desired output would be:

 Date        Product    eCommerce   Sales
12-10-2018      A           1        10
12-11-2018      A           1        0
12-12-2018      A           1        0
12-13-2018      A           1        4
12-14-2018      A           1        0
12-15-2018      A           1        2

12-12-2018      A           0        7
12-13-2018      A           0        0
12-14-2018      A           0        0
12-15-2018      A           0        2

12-11-2018      B           1        8
12-12-2018      B           1        0
12-13-2018      B           1        6

Note: It's just one DataFrame, I just added the spaces to differentiate the indexes.

So in the original DataFrame 12-11 and 12-14 are missing for eCommerce Sales of Product A, 12-13 and 12-14 for non-eCommerce Sales of Product A and 12-12 is missing for eCommerce Sales of product B

The thing is, I managed to achieve what I want, through a very inefficient loop that takes about 25 minutes to run:

df_full= pd.DataFrame(columns=df.columns)
for sku in df['Product'].unique():

    aux=df.loc[df['Product']==sku]
    dates= pd.DataFrame(pd.date_range(start=aux.Date.min(), end=aux.Date.max(),freq='D'),columns=['Date'])

    df3 = df.loc[df['id_prod']==sku].merge(dates,
                                                   how='outer',left_on='Date',
                                                   right_on='Date').sort_values(by='Date')

    df3.fillna(method='ffill',inplace=True)
    df_full= df_full.append(df3)

I'm quite confident that I can do this in a vectorized way, that should take less time (I have 2,300 products for 290 possible dates). Do you know how can I achieve this?

Edit: Added bolded text with better explanation of the problem

Upvotes: 2

Views: 93

Answers (1)

ALollz
ALollz

Reputation: 59519

Use a single DatetimeIndex then groupby + resample.asfreq(), (can use sum for numeric columns) as the date range is group dependent.

import pandas as pd

df['Date'] = pd.to_datetime(df.Date)
df = df.set_index('Date')

df.groupby(['Product', 'eCommerce'], sort=False).Sales.resample('D').sum().reset_index()

If you have many columns you want to fill with zero then:

(df.groupby(['Product', 'eCommerce'], sort=False)
    .resample('D').sum()
    .drop(columns=['Product', 'eCommerce'])
    .reset_index())

Output:

   Product  eCommerce       Date  Sales
0        A          1 2018-12-10     10
1        A          1 2018-12-11      0
2        A          1 2018-12-12      0
3        A          1 2018-12-13      4
4        A          1 2018-12-14      0
5        A          1 2018-12-15      2
6        A          0 2018-12-12      7
7        A          0 2018-12-13      0
8        A          0 2018-12-14      0
9        A          0 2018-12-15      2
10       B          1 2018-12-11      8
11       B          1 2018-12-12      0
12       B          1 2018-12-13      6

Upvotes: 2

Related Questions