Reputation: 1079
I've a dataframe with the following structure:
Time | Company | Product_type | Total_sales |
---|---|---|---|
2021-01-31 06:00:00+00:00 | Adidas | Shoes | 20 |
2021-01-31 05:00:00+00:00 | Adidas | Shoes | 13 |
2021-01-31 03:00:00+00:00 | Adidas | Shoes | 4 |
2021-01-31 03:00:00+00:00 | Nike | Shoes | 5 |
2021-01-31 02:00:00+00:00 | Adidas | Shoes | 3 |
2021-01-31 02:00:00+00:00 | Nike | Shoes | 3 |
What I needed was to "fill" the time_hour gaps with the nearest previous value (time) according to their company and product_type. In this case, for Adidas, a row for the 04:00 is missing so it'll need to be filled with 4, the value from 03:00 sales.
That was solved in the following link so I now have this:
Time | Company | Product_type | Total_sales |
---|---|---|---|
2021-01-31 06:00:00+00:00 | Adidas | Shoes | 20 |
2021-01-31 05:00:00+00:00 | Adidas | Shoes | 13 |
2021-01-31 04:00:00+00:00 | Adidas | Shoes | 4 |
2021-01-31 03:00:00+00:00 | Adidas | Shoes | 4 |
2021-01-31 03:00:00+00:00 | Nike | Shoes | 5 |
2021-01-31 02:00:00+00:00 | Adidas | Shoes | 3 |
2021-01-31 02:00:00+00:00 | Nike | Shoes | 3 |
Now what I want to do is to fill it with the dates until the current time. For example today is 2021-02-05 so I need all the hours (rows) between 2021-01-31 06:00:00 and 2021-02-05 to be added with "20" for Adidas Shoes as the last total_sales value says.
I've thought about simply adding a row with the last value and the current date but it seems it won't work and would add a lot of possible bugs.
Upvotes: 1
Views: 731
Reputation: 1738
This method will extend the dataframe to today's date and then fill in the relevant data up to this point.
It does this by first creating a new dataframe that consists of today's date and the companies and their products from the existing dataframe. Then it appends this new dataframe to the existing dataframe. Now when you run the code from the other answer you should get your desired result.
import datetime
import pandas as pd
import numpy as np
# create the dataframe with today's date
# and the existing companies and their products
today_df = pd.DataFrame({'Time' : [datetime.datetime.now()] * company_products.shape[0],
'Company': company_products['Company'],
'Product_type': company_products['Product_type']})
# append this to the original dataframe
df = df.append(today_df)
# ensure the 'Time' column is a datetime
df['Time'] = pd.to_datetime(df['Time'], utc=True)
# run the other answer's code
resmapled = df.groupby(['Company', 'Product_type']).resample(on='Time', rule='1H').sum().reset_index()
result = resampled.sort_values(['Company', 'Product_type', 'Time']).replace(0,np.nan).ffill()
Note: You could change the datetime.datetime.now()
part of the code to be a specific date - '2021-02-28'
for example.
Upvotes: 1
Reputation: 35115
The code I've created is limited to Adidas, but you can combine Nike in the same way. The main point is to update the original dataframe by creating a pd.datetime_range()
from the last data of the real data to today. At that time, we will fill in the holes backward.
df['Time'] = pd.to_datetime(df['Time'])
df.set_index('Time', inplace=True)
df_adidas = df.loc[df['Company'] == 'Adidas']
tidx = pd.date_range(df_adidas.head(1).index.values[0], datetime.date.today(), freq='1H', tz='UTC')
df_adidas.reindex(tidx, method='bfill')
Company Product_type Total_sales
2021-01-31 06:00:00+00:00 Adidas Shoes 20
2021-01-31 07:00:00+00:00 Adidas Shoes 20
2021-01-31 08:00:00+00:00 Adidas Shoes 20
2021-01-31 09:00:00+00:00 Adidas Shoes 20
2021-01-31 10:00:00+00:00 Adidas Shoes 20
... ... ... ...
2021-02-05 20:00:00+00:00 Adidas Shoes 20
2021-02-05 21:00:00+00:00 Adidas Shoes 20
2021-02-05 22:00:00+00:00 Adidas Shoes 20
2021-02-05 23:00:00+00:00 Adidas Shoes 20
2021-02-06 00:00:00+00:00 Adidas Shoes 20
Upvotes: 1