Reputation: 49
Hi I have a dataframe like below
Cust_ID created_date tran_date Sales_Value Quantity_Sold
0 1 2021-01-31 2021-01-31 126 12
1 1 2021-01-31 2021-03-31 5 3
2 1 2021-01-31 2021-06-30 20 5
3 2 2020-12-31 2021-02-28 200 5
4 2 2020-12-31 2021-03-31 250 25
I need to fill the missing months from the created date to till current month for each customer_ID. basically for each customer I should have one row for each month/year since the customer created date to the current date so the dataframe should look like below if current month is 07/2021
Cust_ID created_date tran_date Sales_Value Quantity_Sold
0 1 2021-01-31 2021-01-31 126 12
1 1 2021-01-31 2021-02-28 0 0
2 1 2021-01-31 2021-03-31 5 3
3 1 2021-01-31 2021-04-30 0 0
4 1 2021-01-31 2021-05-31 0 0
5 1 2021-01-31 2021-06-30 20 5
6 1 2021-01-31 2021-07-31 0 0
7 2 2020-12-31 2020-12-31 0 0
8 2 2020-12-31 2021-01-31 0 0
9 2 2020-12-31 2021-02-28 200 5
10 2 2020-12-31 2021-03-31 250 25
11 2 2020-12-31 2021-04-30 0 0
12 2 2020-12-31 2021-05-31 0 0
13 2 2020-12-31 2021-06-30 0 0
14 2 2020-12-31 2021-07-31 0 0
what's the best way to achieve this?
Upvotes: 2
Views: 950
Reputation: 49
@jezrael Thank you for your solution. When I tried it , it worked as good for the Cust_ID 1 . but for Cust_ID 2 the sales and quantity values are incorrect for some tran_dates
df = pd.DataFrame({
"Cust_ID":
[1,1,1,2,2],
"created_date":
["2021-01-31","2021-01-31","2021-01-31","2020-12-31","2021-12-31"],
"tran_date":
["2021-01-31","2021-03-31","2021-06-30","2021-02-28","2021-03-31"],
"Sales_Value":
[126,5,20,200,250],
"Quantity_Sold":
[12,3,5,5,25]
})
df
df['created_date'] = pd.to_datetime(df['created_date'])
df['tran_date'] = pd.to_datetime(df['tran_date']).dt.to_period('m')
now = pd.to_datetime('now')
print (now)
f = lambda x: x.reindex(pd.period_range(x.name[1], now, freq='M', name='tran_date'),
fill_value=0)
df = (df.set_index('tran_date')
.groupby(['Cust_ID', 'created_date'])[['Sales_Value', 'Quantity_Sold']]
.apply(f)
.reset_index()
)
df['tran_date'] = df['tran_date'].dt.to_timestamp(how='end').dt.normalize()
print (df)
Below is how the output looks. the sales and quatity values are incorrect for the months for jan, feb and june months.
Upvotes: 0
Reputation: 28644
Convert the *dates
to datetime dtype:
temp = df.astype({"created_date": np.datetime64, "tran_date":np.datetime64})
Get the min and max_positions of tran_date
and replace them with values from created_date
and 2021-07-31
respectively:
min_positions = temp.groupby('Cust_ID').tran_date.idxmin()
max_positions = temp.groupby('Cust_ID').tran_date.idxmax()
temp.loc[min_positions, 'tran_date'] = temp.loc[min_positions, 'created_date']
end_date = pd.to_datetime("2021/7/31")
temp.loc[max_positions, 'tran_date'] = end_date
from here on, we can pull it off in two ways - the first one I propose uses Pandas only:
Group by Cust_ID, resample on tran_date
, and then build the frequencies on a Monthly basis:
(temp
.set_index('tran_date')
.groupby('Cust_ID')
.resample('1M')
.asfreq()
.drop(columns='Cust_ID')
.assign(created_date = lambda df: df.created_date.ffill())
.fillna(0)
.reset_index()
.filter(df.columns)
)
Cust_ID created_date tran_date Sales_Value Quantity_Sold
0 1 2021-01-31 2021-01-31 126.0 12.0
1 1 2021-01-31 2021-02-28 0.0 0.0
2 1 2021-01-31 2021-03-31 5.0 3.0
3 1 2021-01-31 2021-04-30 0.0 0.0
4 1 2021-01-31 2021-05-31 0.0 0.0
5 1 2021-01-31 2021-06-30 0.0 0.0
6 1 2021-01-31 2021-07-31 20.0 5.0
7 2 2020-12-31 2020-12-31 200.0 5.0
8 2 2020-12-31 2021-01-31 0.0 0.0
9 2 2020-12-31 2021-02-28 0.0 0.0
10 2 2020-12-31 2021-03-31 0.0 0.0
11 2 2020-12-31 2021-04-30 0.0 0.0
12 2 2020-12-31 2021-05-31 0.0 0.0
13 2 2020-12-31 2021-06-30 0.0 0.0
14 2 2020-12-31 2021-07-31 250.0 25.0
The other option uses complete
, with fill_direction
from pyjanitor
:
# pip install pyjanitor
import janitor
import pandas as pd
# create dictionary pairing `tran_date` with min and max_dates
new_date_values = lambda dates: pd.date_range(start = dates.min(),
end = dates.max(),
freq='1M')
new_date_values = {"tran_date" : new_date_values}
# `completes` each group with the new dates
(temp.complete([new_tran_date_values],
by = 'Cust_ID')
.fill_direction({"created_date" : "down"}) # wrapper for `ffill/bfill`
.fillna(0)
.filter(df)
)
Cust_ID created_date tran_date Sales_Value Quantity_Sold
0 1 2021-01-31 2021-01-31 126.0 12.0
1 1 2021-01-31 2021-02-28 0.0 0.0
2 1 2021-01-31 2021-03-31 5.0 3.0
3 1 2021-01-31 2021-04-30 0.0 0.0
4 1 2021-01-31 2021-05-31 0.0 0.0
5 1 2021-01-31 2021-06-30 0.0 0.0
6 1 2021-01-31 2021-07-31 20.0 5.0
7 2 2020-12-31 2020-12-31 200.0 5.0
8 2 2020-12-31 2021-01-31 0.0 0.0
9 2 2020-12-31 2021-02-28 0.0 0.0
10 2 2020-12-31 2021-03-31 0.0 0.0
11 2 2020-12-31 2021-04-30 0.0 0.0
12 2 2020-12-31 2021-05-31 0.0 0.0
13 2 2020-12-31 2021-06-30 0.0 0.0
14 2 2020-12-31 2021-07-31 250.0 25.0
Upvotes: 1
Reputation: 862481
Use GroupBy.apply
with custom function with DataFrame.reindex
and period_range
with convert tran_date
to months periods:
df['created_date'] = pd.to_datetime(df['created_date'])
df['tran_date'] = pd.to_datetime(df['tran_date']).dt.to_period('m')
now = pd.to_datetime('now')
print (now)
f = lambda x: x.reindex(pd.period_range(x.name[1], now, freq='M', name='tran_date'),
fill_value=0)
df = (df.set_index('tran_date')
.groupby(['Cust_ID', 'created_date'])[['Sales_Value', 'Quantity_Sold']]
.apply(f)
.reset_index()
)
df['tran_date'] = df['tran_date'].dt.to_timestamp(how='end').dt.normalize()
print (df)
Cust_ID created_date tran_date Sales_Value Quantity_Sold
0 1 2021-01-31 2021-01-31 126 12
1 1 2021-01-31 2021-02-28 0 0
2 1 2021-01-31 2021-03-31 5 3
3 1 2021-01-31 2021-04-30 0 0
4 1 2021-01-31 2021-05-31 0 0
5 1 2021-01-31 2021-06-30 20 5
6 1 2021-01-31 2021-07-31 0 0
7 2 2020-12-31 2020-12-31 0 0
8 2 2020-12-31 2021-01-31 0 0
9 2 2020-12-31 2021-02-28 200 5
10 2 2020-12-31 2021-03-31 250 25
11 2 2020-12-31 2021-04-30 0 0
12 2 2020-12-31 2021-05-31 0 0
13 2 2020-12-31 2021-06-30 0 0
14 2 2020-12-31 2021-07-31 0 0
Upvotes: 2