Reputation: 123
I have a dataframe with date column, some of the data are missing based on year and month. i have to display are the months for all the years in my dataset and the corresponding columns should display with zeros.
My dataframe looks like this
Date Churn Churnrate customerID
2008,01 726.0 0.542398 2763
2008,02 345.0 0.257751 1351
2012,11 NaN NaN 6
2013,01 3.0 0.002241 24
2013,02 10.0 0.007471 34
2013,03 25.0 0.018678 73
2013,04 25.0 0.018678 75
2013,05 14.0 0.010459 61
2013,06 19.0 0.014195 69
2013,07 27.0 0.020172 103
2013,08 22.0 0.016436 79
2013,09 19.0 0.014195 70
2013,10 28.0 0.020919 83
2013,11 22.0 0.016436 78
2013,12 19.0 0.014195 75
2014,01 17.0 0.012701 63
2014,02 21.0 0.015689 55
2014,03 7.0 0.005230 66
2014,04 24.0 0.017931 86
2014,05 18.0 0.013448 90
2014,06 14.0 0.010459 50
For example in the year 2018, i have only two month records, but i want to display all the 12 months with 0s in the corresponding columns
My another dataframe looks like this
Months Retention_Rate Customer_Count
0 2008/01 0.145916 133
1 2008/02 0.924663 762
2 2008/03 0.074544 67
3 2014/07 0.058684 45
4 2014/08 0.069786 61
5 2014/09 0.076130 64
6 2014/10 0.061856 60
7 2014/11 0.082474 69
I have used the same answer which is given below
predicted_retention_rate = predicted_retention_rate.set_index('Months')
idx =(pd.MultiIndex.from_product(predicted_retention_rate.index.str.split('/', expand=True).levels)
.map('/'.join))
final_retention_rate_predicted = predicted_retention_rate.reindex(idx, fill_value=0).rename_axis('Months').reset_index()
print (final_retention_rate_predicted)
But some of the months are missing in this output
Months Retention_Rate Customer_Count
0 2008/01 0.145916 133
1 2008/02 0.924663 762
2 2008/03 0.074544 67
3 2008/07 0.000000 0
4 2008/08 0.000000 0
5 2008/09 0.000000 0
6 2008/10 0.000000 0
7 2008/11 0.000000 0
8 2014/01 0.000000 0
9 2014/02 0.000000 0
10 2014/03 0.000000 0
11 2014/07 0.058684 45
12 2014/08 0.069786 61
13 2014/09 0.076130 64
14 2014/10 0.061856 60
15 2014/11 0.082474 69
Look at the above dataframe, year 2008 contains 01,02,03 but not 04,05,06 and the same in 2014. May i known where i went wrong here.
Upvotes: 1
Views: 1724
Reputation: 863731
I think need reindex
by new index
created by split
Date
to MultiIndex
and map
with join
:
df = df.set_index('Date')
idx =(pd.MultiIndex.from_product(df.index.str.split(',', expand=True).levels)
.map(','.join))
df = df.reindex(idx, fill_value=0).rename_axis('Date').reset_index()
print (df.head())
Date Churn Churnrate customerID
0 2008,01 726.0 0.542398 2763
1 2008,02 345.0 0.257751 1351
2 2008,03 0.0 0.000000 0
3 2008,04 0.0 0.000000 0
4 2008,05 0.0 0.000000 0
EDIT: Solution with define all Month
s by range(1,13)
df = df.set_index('Months')
years = df.index.str.split('/', expand=True).levels[0]
idx = (pd.MultiIndex.from_product([years,
pd.Series(range(1, 13)).astype(str).str.zfill(2)])
.map('/'.join))
df = df.reindex(idx, fill_value=0).rename_axis('Date').reset_index()
print (df)
Date Retention_Rate Customer_Count
0 2008/01 0.145916 133
1 2008/02 0.924663 762
2 2008/03 0.074544 67
3 2008/04 0.000000 0
4 2008/05 0.000000 0
5 2008/06 0.000000 0
6 2008/07 0.000000 0
7 2008/08 0.000000 0
8 2008/09 0.000000 0
9 2008/10 0.000000 0
10 2008/11 0.000000 0
11 2008/12 0.000000 0
12 2014/01 0.000000 0
13 2014/02 0.000000 0
14 2014/03 0.000000 0
15 2014/04 0.000000 0
16 2014/05 0.000000 0
17 2014/06 0.000000 0
18 2014/07 0.058684 45
19 2014/08 0.069786 61
20 2014/09 0.076130 64
21 2014/10 0.061856 60
22 2014/11 0.082474 69
23 2014/12 0.000000 0
If need replace the missing year and the corresponding columns with zeros:
print (df)
Year Churn_Count Churn_Rate Customer_Count
2008 1071.0 0.800149 4114
2012 0.0 0.000000 6
2013 233.0 0.174075 824
2014 101.0 0.075458 410
Then use:
df1 = (df.set_index('Year')
.reindex(range(2008, 2015), fill_value=0)
.reset_index())
print (df1)
Year Churn_Count Churn_Rate Customer_Count
0 2008 1071.0 0.800149 4114
1 2009 0.0 0.000000 0
2 2010 0.0 0.000000 0
3 2011 0.0 0.000000 0
4 2012 0.0 0.000000 6
5 2013 233.0 0.174075 824
6 2014 101.0 0.075458 410
More dynamic solution for reindex
by min and max year:
df1 = df.set_index('Year')
df1 = (df1.reindex(range(df1.index.min(), df1.index.max() + 1), fill_value=0)
.reset_index())
print (df1)
Year Churn_Count Churn_Rate Customer_Count
0 2008 1071.0 0.800149 4114
1 2009 0.0 0.000000 0
2 2010 0.0 0.000000 0
3 2011 0.0 0.000000 0
4 2012 0.0 0.000000 6
5 2013 233.0 0.174075 824
6 2014 101.0 0.075458 410
Upvotes: 2
Reputation: 13447
I think that another simple approach could be this.
import pandas as pd
df = pd.DataFrame({"date":["2010-01", "2010-02", "2011-01"],
"a": [1, 2, 3],
"b":[0.2,-0.1,0.4]})
df["date"] = pd.to_datetime(df["date"])
all_dates = pd.DataFrame({"date":pd.date_range(start=df["date"].min(),
end=df["date"].max(),
freq="MS")})
df = pd.merge(all_dates, df, how="left", on="date").fillna(0)
If date
is your index you can just play with .reset_index()
and .set_index()
. Then if you want to maintain the same date format just add df["date"] = df["date"].dt.strftime("%Y-%m")
Upvotes: 1