sangeetha sivakumar
sangeetha sivakumar

Reputation: 123

Replace the missing months & year in date column using python

I have a dataframe with date column, some of the data are missing based on year and month. i have to display are the months for all the years in my dataset and the corresponding columns should display with zeros.

My dataframe looks like this

 Date    Churn    Churnrate  customerID
2008,01  726.0  0.542398        2763
2008,02  345.0  0.257751        1351
2012,11    NaN       NaN           6
2013,01    3.0  0.002241          24
2013,02   10.0  0.007471          34
2013,03   25.0  0.018678          73
2013,04   25.0  0.018678          75
2013,05   14.0  0.010459          61
2013,06   19.0  0.014195          69
2013,07   27.0  0.020172         103
2013,08   22.0  0.016436          79
2013,09   19.0  0.014195          70
2013,10   28.0  0.020919          83
2013,11   22.0  0.016436          78
2013,12   19.0  0.014195          75
2014,01   17.0  0.012701          63
2014,02   21.0  0.015689          55
2014,03    7.0  0.005230          66
2014,04   24.0  0.017931          86
2014,05   18.0  0.013448          90
2014,06   14.0  0.010459          50

For example in the year 2018, i have only two month records, but i want to display all the 12 months with 0s in the corresponding columns

My another dataframe looks like this

Months  Retention_Rate  Customer_Count
0  2008/01        0.145916             133
1  2008/02        0.924663             762
2  2008/03        0.074544              67
3  2014/07        0.058684              45
4  2014/08        0.069786              61
5  2014/09        0.076130              64
6  2014/10        0.061856              60
7  2014/11        0.082474              69

I have used the same answer which is given below

predicted_retention_rate = predicted_retention_rate.set_index('Months')
idx =(pd.MultiIndex.from_product(predicted_retention_rate.index.str.split('/', expand=True).levels)
        .map('/'.join))

final_retention_rate_predicted = predicted_retention_rate.reindex(idx, fill_value=0).rename_axis('Months').reset_index()
print (final_retention_rate_predicted)

But some of the months are missing in this output

Months  Retention_Rate  Customer_Count
0   2008/01        0.145916             133
1   2008/02        0.924663             762
2   2008/03        0.074544              67
3   2008/07        0.000000               0
4   2008/08        0.000000               0
5   2008/09        0.000000               0
6   2008/10        0.000000               0
7   2008/11        0.000000               0
8   2014/01        0.000000               0
9   2014/02        0.000000               0
10  2014/03        0.000000               0
11  2014/07        0.058684              45
12  2014/08        0.069786              61
13  2014/09        0.076130              64
14  2014/10        0.061856              60
15  2014/11        0.082474              69

Look at the above dataframe, year 2008 contains 01,02,03 but not 04,05,06 and the same in 2014. May i known where i went wrong here.

Upvotes: 1

Views: 1724

Answers (2)

jezrael
jezrael

Reputation: 863731

I think need reindex by new index created by split Date to MultiIndex and map with join:

df = df.set_index('Date')
idx =(pd.MultiIndex.from_product(df.index.str.split(',', expand=True).levels)
        .map(','.join))

df = df.reindex(idx, fill_value=0).rename_axis('Date').reset_index()
print (df.head())
      Date  Churn  Churnrate  customerID
0  2008,01  726.0   0.542398        2763
1  2008,02  345.0   0.257751        1351
2  2008,03    0.0   0.000000           0
3  2008,04    0.0   0.000000           0
4  2008,05    0.0   0.000000           0

EDIT: Solution with define all Months by range(1,13)

df = df.set_index('Months')
years = df.index.str.split('/', expand=True).levels[0]
idx = (pd.MultiIndex.from_product([years, 
                                  pd.Series(range(1, 13)).astype(str).str.zfill(2)])
        .map('/'.join))

df = df.reindex(idx, fill_value=0).rename_axis('Date').reset_index()
print (df)
       Date  Retention_Rate  Customer_Count
0   2008/01        0.145916             133
1   2008/02        0.924663             762
2   2008/03        0.074544              67
3   2008/04        0.000000               0
4   2008/05        0.000000               0
5   2008/06        0.000000               0
6   2008/07        0.000000               0
7   2008/08        0.000000               0
8   2008/09        0.000000               0
9   2008/10        0.000000               0
10  2008/11        0.000000               0
11  2008/12        0.000000               0
12  2014/01        0.000000               0
13  2014/02        0.000000               0
14  2014/03        0.000000               0
15  2014/04        0.000000               0
16  2014/05        0.000000               0
17  2014/06        0.000000               0
18  2014/07        0.058684              45
19  2014/08        0.069786              61
20  2014/09        0.076130              64
21  2014/10        0.061856              60
22  2014/11        0.082474              69
23  2014/12        0.000000               0

If need replace the missing year and the corresponding columns with zeros:

print (df)
Year   Churn_Count  Churn_Rate  Customer_Count                                        
2008       1071.0    0.800149             4114
2012          0.0    0.000000                6
2013        233.0    0.174075              824
2014        101.0    0.075458              410

Then use:

df1 = (df.set_index('Year')
        .reindex(range(2008, 2015), fill_value=0)
        .reset_index())
print (df1)
   Year  Churn_Count  Churn_Rate  Customer_Count
0  2008       1071.0    0.800149            4114
1  2009          0.0    0.000000               0
2  2010          0.0    0.000000               0
3  2011          0.0    0.000000               0
4  2012          0.0    0.000000               6
5  2013        233.0    0.174075             824
6  2014        101.0    0.075458             410

More dynamic solution for reindex by min and max year:

df1 = df.set_index('Year')
df1 = (df1.reindex(range(df1.index.min(), df1.index.max() + 1), fill_value=0)
          .reset_index())
print (df1)
   Year  Churn_Count  Churn_Rate  Customer_Count
0  2008       1071.0    0.800149            4114
1  2009          0.0    0.000000               0
2  2010          0.0    0.000000               0
3  2011          0.0    0.000000               0
4  2012          0.0    0.000000               6
5  2013        233.0    0.174075             824
6  2014        101.0    0.075458             410

Upvotes: 2

rpanai
rpanai

Reputation: 13447

I think that another simple approach could be this.

import pandas as pd
df = pd.DataFrame({"date":["2010-01", "2010-02", "2011-01"], 
                   "a": [1, 2, 3], 
                   "b":[0.2,-0.1,0.4]})
df["date"] = pd.to_datetime(df["date"])
all_dates = pd.DataFrame({"date":pd.date_range(start=df["date"].min(), 
                                               end=df["date"].max(), 
                                               freq="MS")})
df = pd.merge(all_dates, df, how="left", on="date").fillna(0)

If date is your index you can just play with .reset_index() and .set_index(). Then if you want to maintain the same date format just add df["date"] = df["date"].dt.strftime("%Y-%m")

Upvotes: 1

Related Questions