Reputation: 49

Pandas DataFrame -add rows for missing months

Hi I have a dataframe like below

  Cust_ID   created_date   tran_date        Sales_Value  Quantity_Sold
0        1   2021-01-31    2021-01-31         126             12
1        1   2021-01-31    2021-03-31           5              3
2        1   2021-01-31    2021-06-30          20              5
3        2   2020-12-31    2021-02-28         200              5
4        2   2020-12-31    2021-03-31         250             25

I need to fill the missing months from the created date to till current month for each customer_ID. basically for each customer I should have one row for each month/year since the customer created date to the current date so the dataframe should look like below if current month is 07/2021

    Cust_ID   created_date   tran_date        Sales_Value  Quantity_Sold
0        1   2021-01-31    2021-01-31         126             12
1        1   2021-01-31    2021-02-28           0              0
2        1   2021-01-31    2021-03-31           5              3
3        1   2021-01-31    2021-04-30           0              0
4        1   2021-01-31    2021-05-31           0              0
5        1   2021-01-31    2021-06-30          20              5
6        1   2021-01-31    2021-07-31           0              0
7        2   2020-12-31    2020-12-31           0              0
8        2   2020-12-31    2021-01-31           0              0
9        2   2020-12-31    2021-02-28         200              5
10       2   2020-12-31    2021-03-31         250             25
11       2   2020-12-31    2021-04-30           0              0
12       2   2020-12-31    2021-05-31           0              0
13       2   2020-12-31    2021-06-30           0              0
14       2   2020-12-31    2021-07-31           0              0

what's the best way to achieve this?

Upvotes: 2

Answers (3)

suma

Reputation: 49

@jezrael Thank you for your solution. When I tried it , it worked as good for the Cust_ID 1 . but for Cust_ID 2 the sales and quantity values are incorrect for some tran_dates

df = pd.DataFrame({
        "Cust_ID":
            [1,1,1,2,2],
    "created_date":
            ["2021-01-31","2021-01-31","2021-01-31","2020-12-31","2021-12-31"],
        "tran_date":
            ["2021-01-31","2021-03-31","2021-06-30","2021-02-28","2021-03-31"],
        "Sales_Value":
            [126,5,20,200,250],           
        "Quantity_Sold":
             [12,3,5,5,25] 
        
})

df

df['created_date'] = pd.to_datetime(df['created_date'])
df['tran_date'] = pd.to_datetime(df['tran_date']).dt.to_period('m')
now = pd.to_datetime('now')
print (now)

f = lambda x: x.reindex(pd.period_range(x.name[1], now, freq='M', name='tran_date'), 
                        fill_value=0)
df = (df.set_index('tran_date')
        .groupby(['Cust_ID', 'created_date'])[['Sales_Value', 'Quantity_Sold']]
        .apply(f)
        .reset_index()
        )

df['tran_date'] = df['tran_date'].dt.to_timestamp(how='end').dt.normalize()
print (df)

Below is how the output looks. the sales and quatity values are incorrect for the months for jan, feb and june months.

Upvotes: 0

sammywemmy

Reputation: 28729

Convert the *dates to datetime dtype:

temp = df.astype({"created_date": np.datetime64, "tran_date":np.datetime64})

Get the min and max_positions of tran_date and replace them with values from created_date and 2021-07-31 respectively:

min_positions = temp.groupby('Cust_ID').tran_date.idxmin()

max_positions = temp.groupby('Cust_ID').tran_date.idxmax()

temp.loc[min_positions, 'tran_date'] = temp.loc[min_positions, 'created_date']

end_date = pd.to_datetime("2021/7/31")

temp.loc[max_positions, 'tran_date'] = end_date

from here on, we can pull it off in two ways - the first one I propose uses Pandas only:

Group by Cust_ID, resample on tran_date, and then build the frequencies on a Monthly basis:

(temp
.set_index('tran_date')
.groupby('Cust_ID')
.resample('1M')
.asfreq()
.drop(columns='Cust_ID')
.assign(created_date = lambda df: df.created_date.ffill())
.fillna(0)
.reset_index()
.filter(df.columns)
)
 
    Cust_ID created_date  tran_date  Sales_Value  Quantity_Sold
0         1   2021-01-31 2021-01-31        126.0           12.0
1         1   2021-01-31 2021-02-28          0.0            0.0
2         1   2021-01-31 2021-03-31          5.0            3.0
3         1   2021-01-31 2021-04-30          0.0            0.0
4         1   2021-01-31 2021-05-31          0.0            0.0
5         1   2021-01-31 2021-06-30          0.0            0.0
6         1   2021-01-31 2021-07-31         20.0            5.0
7         2   2020-12-31 2020-12-31        200.0            5.0
8         2   2020-12-31 2021-01-31          0.0            0.0
9         2   2020-12-31 2021-02-28          0.0            0.0
10        2   2020-12-31 2021-03-31          0.0            0.0
11        2   2020-12-31 2021-04-30          0.0            0.0
12        2   2020-12-31 2021-05-31          0.0            0.0
13        2   2020-12-31 2021-06-30          0.0            0.0
14        2   2020-12-31 2021-07-31        250.0           25.0

The other option uses complete, with fill_direction from pyjanitor:

 # pip install pyjanitor
 import janitor
 import pandas as pd

 # create dictionary pairing `tran_date` with min and max_dates
  new_date_values = lambda dates: pd.date_range(start = dates.min(), 
                                                end = dates.max(), 
                                                freq='1M')
  new_date_values = {"tran_date" : new_date_values}
  
  # `completes` each group with the new dates
  (temp.complete([new_tran_date_values], 
                 by = 'Cust_ID')
   .fill_direction({"created_date" : "down"}) # wrapper for `ffill/bfill`
   .fillna(0)
   .filter(df)
   )

    Cust_ID created_date  tran_date  Sales_Value  Quantity_Sold
0         1   2021-01-31 2021-01-31        126.0           12.0
1         1   2021-01-31 2021-02-28          0.0            0.0
2         1   2021-01-31 2021-03-31          5.0            3.0
3         1   2021-01-31 2021-04-30          0.0            0.0
4         1   2021-01-31 2021-05-31          0.0            0.0
5         1   2021-01-31 2021-06-30          0.0            0.0
6         1   2021-01-31 2021-07-31         20.0            5.0
7         2   2020-12-31 2020-12-31        200.0            5.0
8         2   2020-12-31 2021-01-31          0.0            0.0
9         2   2020-12-31 2021-02-28          0.0            0.0
10        2   2020-12-31 2021-03-31          0.0            0.0
11        2   2020-12-31 2021-04-30          0.0            0.0
12        2   2020-12-31 2021-05-31          0.0            0.0
13        2   2020-12-31 2021-06-30          0.0            0.0
14        2   2020-12-31 2021-07-31        250.0           25.0

Upvotes: 1

jezrael

Reputation: 863651

Use GroupBy.apply with custom function with DataFrame.reindex and period_range with convert tran_date to months periods:

df['created_date'] = pd.to_datetime(df['created_date'])
df['tran_date'] = pd.to_datetime(df['tran_date']).dt.to_period('m')
now = pd.to_datetime('now')
print (now)

f = lambda x: x.reindex(pd.period_range(x.name[1], now, freq='M', name='tran_date'), 
                        fill_value=0)
df = (df.set_index('tran_date')
        .groupby(['Cust_ID', 'created_date'])[['Sales_Value', 'Quantity_Sold']]
        .apply(f)
        .reset_index()
        )

df['tran_date'] = df['tran_date'].dt.to_timestamp(how='end').dt.normalize()
print (df)
    Cust_ID created_date  tran_date  Sales_Value  Quantity_Sold
0         1   2021-01-31 2021-01-31          126             12
1         1   2021-01-31 2021-02-28            0              0
2         1   2021-01-31 2021-03-31            5              3
3         1   2021-01-31 2021-04-30            0              0
4         1   2021-01-31 2021-05-31            0              0
5         1   2021-01-31 2021-06-30           20              5
6         1   2021-01-31 2021-07-31            0              0
7         2   2020-12-31 2020-12-31            0              0
8         2   2020-12-31 2021-01-31            0              0
9         2   2020-12-31 2021-02-28          200              5
10        2   2020-12-31 2021-03-31          250             25
11        2   2020-12-31 2021-04-30            0              0
12        2   2020-12-31 2021-05-31            0              0
13        2   2020-12-31 2021-06-30            0              0
14        2   2020-12-31 2021-07-31            0              0

Upvotes: 2

Pandas DataFrame -add rows for missing months

Answers (3)

Related Questions