Reputation: 1079
I've a dataframe with the following structure:
Time | Company | Product_type | Total_sales |
---|---|---|---|
2021-01-31 06:00:00+00:00 | Adidas | Shoes | 20 |
2021-01-31 05:00:00+00:00 | Adidas | Shoes | 13 |
2021-01-31 03:00:00+00:00 | Adidas | Shoes | 4 |
2021-01-31 03:00:00+00:00 | Nike | Shoes | 5 |
2021-01-31 02:00:00+00:00 | Adidas | Shoes | 3 |
2021-01-31 02:00:00+00:00 | Nike | Shoes | 3 |
What I need to do is to "fill" the time_hour gaps with the nearest previous value (time) according to their company and product_type. In this case, for Adidas, a row for the 04:00 is missing so it'll need to be filled with 4, the value from 03:00 sales.
Time | Company | Product_type | Total_sales |
---|---|---|---|
2021-01-31 06:00:00+00:00 | Adidas | Shoes | 20 |
2021-01-31 05:00:00+00:00 | Adidas | Shoes | 13 |
2021-01-31 04:00:00+00:00 | Adidas | Shoes | 4 |
2021-01-31 03:00:00+00:00 | Adidas | Shoes | 4 |
2021-01-31 03:00:00+00:00 | Nike | Shoes | 5 |
2021-01-31 02:00:00+00:00 | Adidas | Shoes | 3 |
2021-01-31 02:00:00+00:00 | Nike | Shoes | 3 |
I know how to do it in the case of using a datetime as the unique index but this multi_index setting is something I couldn't solve for the moment.
Upvotes: 0
Views: 86
Reputation: 1748
First we need to make sure that the Time column is a datetime column.
df['Time'] = pd.to_datetime(df['Time'])
Then we need to upsample the times to generate missing hours for each Company
and Product_type
.
df = df.groupby(['Company', 'Product_type']).resample(on='Time', rule='1H').sum().reset_index()
This gives us
Company Product_type Time Total_sales
0 Adidas Shoes 2021-01-31 02:00:00+00:00 3
1 Adidas Shoes 2021-01-31 03:00:00+00:00 4
2 Adidas Shoes 2021-01-31 04:00:00+00:00 0
3 Adidas Shoes 2021-01-31 05:00:00+00:00 13
4 Adidas Shoes 2021-01-31 06:00:00+00:00 20
5 Nike Shoes 2021-01-31 02:00:00+00:00 3
6 Nike Shoes 2021-01-31 03:00:00+00:00 5
Then we can fill the zeros with the closest previous hour's data.
result = df.sort_values(['Company', 'Product_type', 'Time']).replace(0,np.nan).ffill()
Result
is
Company Product_type Time Total_sales
0 Adidas Shoes 2021-01-31 02:00:00+00:00 3.0
1 Adidas Shoes 2021-01-31 03:00:00+00:00 4.0
2 Adidas Shoes 2021-01-31 04:00:00+00:00 4.0
3 Adidas Shoes 2021-01-31 05:00:00+00:00 13.0
4 Adidas Shoes 2021-01-31 06:00:00+00:00 20.0
5 Nike Shoes 2021-01-31 02:00:00+00:00 3.0
6 Nike Shoes 2021-01-31 03:00:00+00:00 5.0
Upvotes: 1