Faliha Zikra
Faliha Zikra

Reputation: 421

Generating continuous series of missing year month for data in pandas

Here is the data. For each person, it gives the monetary contribution they made for that month:

    df.head()

    | year_mo | name  | contribution|
    | 202001  | Fey   |  30         |
    | 202003  | Fey   |  20         |
    | 202004  | Fey   |  10         |
    | 202001  | Jay   |  30         |
    | 202003  | Jay   |  20         |

I want to generate a continuous series of year_mo for each person from 202001-202006 where if the person didn't make a contribution, it would be 0 like this :

    | year_mo | name  | contribution|
    | 202001  | Fey   |  30         |
    | 202002  | Fey   |   0         |
    | 202003  | Fey   |  20         |
    | 202004  | Fey   |  10         |
    | 202005  | Fey   |   0         |
    | 202006  | Fey   |   0         |
    | 202001  | Jay   |  30         |
    | 202002  | Jay   |   0         |
    | 202003  | Jay   |  20         |
    | 202004  | Jay   |   0         |
    | 202005  | Jay   |   0         |
    | 202006  | Jay   |   0         |

In order to do this where I tried to do an outer join with the continuous year month df, I tried the following but didn't get the desired output:

   years = list(range(2020,2021,1))
   months= list(range(1,7,1))
   rows_list = []


   for yr in years:
     for mn in months:
     dict1 = {}
     dict1 = {'key' :1, 'year_mo' : (yr*100)+mn}
     rows_list.append(dict1)
   yearmo_df = pd.DataFrame(rows_list)
 
   df['key'] = 1

 
   new_df = pd.merge(left=df, right=yearmo_df, left_on='year_mo' ,right_on='year_mo', how='outer')

Upvotes: 1

Views: 31

Answers (1)

Quang Hoang
Quang Hoang

Reputation: 150735

We can use MultiIndex and reindex:

new_idx = pd.MultiIndex.from_product([np.arange(202001,202007), df['name'].unique()],
                                     names=['year_mo','name'])

(df.set_index(['year_mo', 'name'])
   .reindex(new_idx, fill_value=0)
   .sort_index(level=['name','year_mo'])
   .reset_index()
)

Another optino is pivot_table followed by re_index, and stack:

(df.pivot_table(index='year_mo', columns='name', fill_value=0)
   .reindex(np.arange(202001,202007), fill_value=0)
   .stack()
   .reset_index()
)

Output:

    year_mo name  contribution
0    202001  Fey            30
1    202002  Fey             0
2    202003  Fey            20
3    202004  Fey            10
4    202005  Fey             0
5    202006  Fey             0
6    202001  Jay            30
7    202002  Jay             0
8    202003  Jay            20
9    202004  Jay             0
10   202005  Jay             0
11   202006  Jay             0

Upvotes: 1

Related Questions