Rod0n
Rod0n

Reputation: 1079

Fill pandas multi_index dataframe empty hours with nearest (previous) time

I've a dataframe with the following structure:

Time Company Product_type Total_sales
2021-01-31 06:00:00+00:00 Adidas Shoes 20
2021-01-31 05:00:00+00:00 Adidas Shoes 13
2021-01-31 03:00:00+00:00 Adidas Shoes 4
2021-01-31 03:00:00+00:00 Nike Shoes 5
2021-01-31 02:00:00+00:00 Adidas Shoes 3
2021-01-31 02:00:00+00:00 Nike Shoes 3

What I need to do is to "fill" the time_hour gaps with the nearest previous value (time) according to their company and product_type. In this case, for Adidas, a row for the 04:00 is missing so it'll need to be filled with 4, the value from 03:00 sales.

Time Company Product_type Total_sales
2021-01-31 06:00:00+00:00 Adidas Shoes 20
2021-01-31 05:00:00+00:00 Adidas Shoes 13
2021-01-31 04:00:00+00:00 Adidas Shoes 4
2021-01-31 03:00:00+00:00 Adidas Shoes 4
2021-01-31 03:00:00+00:00 Nike Shoes 5
2021-01-31 02:00:00+00:00 Adidas Shoes 3
2021-01-31 02:00:00+00:00 Nike Shoes 3

I know how to do it in the case of using a datetime as the unique index but this multi_index setting is something I couldn't solve for the moment.

Upvotes: 0

Views: 86

Answers (1)

mullinscr
mullinscr

Reputation: 1748

First we need to make sure that the Time column is a datetime column.

df['Time'] = pd.to_datetime(df['Time'])

Then we need to upsample the times to generate missing hours for each Company and Product_type.

df = df.groupby(['Company', 'Product_type']).resample(on='Time', rule='1H').sum().reset_index()

This gives us

  Company Product_type                      Time  Total_sales
0  Adidas        Shoes 2021-01-31 02:00:00+00:00            3
1  Adidas        Shoes 2021-01-31 03:00:00+00:00            4
2  Adidas        Shoes 2021-01-31 04:00:00+00:00            0
3  Adidas        Shoes 2021-01-31 05:00:00+00:00           13
4  Adidas        Shoes 2021-01-31 06:00:00+00:00           20
5    Nike        Shoes 2021-01-31 02:00:00+00:00            3
6    Nike        Shoes 2021-01-31 03:00:00+00:00            5

Then we can fill the zeros with the closest previous hour's data.

result = df.sort_values(['Company', 'Product_type', 'Time']).replace(0,np.nan).ffill()

Result is

  Company Product_type                      Time  Total_sales
0  Adidas        Shoes 2021-01-31 02:00:00+00:00          3.0
1  Adidas        Shoes 2021-01-31 03:00:00+00:00          4.0
2  Adidas        Shoes 2021-01-31 04:00:00+00:00          4.0
3  Adidas        Shoes 2021-01-31 05:00:00+00:00         13.0
4  Adidas        Shoes 2021-01-31 06:00:00+00:00         20.0
5    Nike        Shoes 2021-01-31 02:00:00+00:00          3.0
6    Nike        Shoes 2021-01-31 03:00:00+00:00          5.0

Upvotes: 1

Related Questions