Python: fill missing dates for each group

Question

I have a DataFrame which looks like this:

x = pd.DataFrame({'user': ['a', 'a', 'a', 'a', 'a', 'b', 'b', 'b', 'b','b'], 'rd': ['2016-01-01', '2016-01-01' ,
                        '2016-02-01', '2016-02-01', '2016-02-01',  '2016-05-01', '2016-05-01', 
                            '2016-06-01','2016-06-01', '2016-06-01'],
                  'fd' : ['2016-02-01', '2016-04-01', '2016-03-01', '2016-04-01', '2016-05-01',
                         '2016-06-01', '2016-07-01', '2016-08-01', '2016-07-01', '2016-09-01'],
                  'val': [3, 4, 16, 7, 9, 2, 5, 11, 20, 1]})

x.head(6)

       fd          rd     user val
0   2016-02-01  2016-01-01  a   3
1   2016-04-01  2016-01-01  a   4
2   2016-03-01  2016-02-01  a   16
3   2016-04-01  2016-02-01  a   7
4   2016-05-01  2016-02-01  a   9
5   2016-06-01  2016-05-01  b   2

x['rd'] = pd.to_datetime(x['rd'])
x['fd'] = pd.to_datetime(x['fd'])

For each rd date I would like to have the next 3 months dates. For instance:

rd = 2016-01-01

I would like to have:

fd = [2016-02-01, 2016-03-01, 2016-04-01]

Basically: for each rd date I want the next 3 months as fd dates. In my dataset I have missing dates both in rd (2016-03-01, 2016-04-01) and in fd once I have the rd date (rd = 2016-01-01, fd missing = 2016-03-01).

Furthermore I have 2 different users x['user'].unique() = ['a', 'b'] . So I may have missing dates (both 'rd' and 'fd') in one user, in the other or in both.

What I would like to achieve is an efficient way to get a dataframe with all dates for all users.

The question starts from an already answered one Question , but the problem here is a little more complex, since I'm not able to fit Multiindex to the problem at hand.

What I did until now was to create the 2 column of dates:

index = pd.date_range(x['rd'].min(),
                          x['rd'].max(), freq='MS')

from datetime import datetime
from dateutil.relativedelta import relativedelta
def add_months(date):
   fcs_dates = [date + relativedelta(months = 1), date + relativedelta(months = 2), date + relativedelta(months = 3)]
   return fcs_dates

fcs_dates = list(map(lambda x: add_months(x), index.tolist()))
fcs_dates = [j for i in fcs_dates for j in i]
index3 = index.tolist()*3
index3.sort()

So the output is:

list(zip(index3, fcs_dates))[:5]

[(Timestamp('2016-01-01 00:00:00', freq='MS'),
  Timestamp('2016-02-01 00:00:00', freq='MS')),
 (Timestamp('2016-01-01 00:00:00', freq='MS'),
  Timestamp('2016-03-01 00:00:00', freq='MS')),
 (Timestamp('2016-01-01 00:00:00', freq='MS'),
  Timestamp('2016-04-01 00:00:00', freq='MS')),
 (Timestamp('2016-02-01 00:00:00', freq='MS'),
  Timestamp('2016-03-01 00:00:00', freq='MS')),
 (Timestamp('2016-02-01 00:00:00', freq='MS'),
  Timestamp('2016-04-01 00:00:00', freq='MS'))]

Unfortunately I have no clue about how to plug this into MultiIndex function.

Thank you for your help

Tommaso Guerrini · Accepted Answer

So, I solved my own question by doing a left join for each group (user), where the left dataframe is the one constructed with dates.

pd.DataFrame with dates:

left_df = pd.DataFrame({'rd' : index_3, 'fd' : fcs_dates})
left_df['rd'] = left_df['rd'].astype(str)
left_df['fd'] = left_df['fd'].astype(str)

grouped by user DataFrame:

df_gr = x.groupby(['user'])
list_gr = []
for i, gr in df_gr:
    gr_new = pd.merge(left_df, gr, left_on= ['rd', 'fd'],
                              right_on = ['rd', 'fd'],
                             how = 'left')
    list_gr.append(gr_new)

df_final = pd.concat(list_gr)

final dataframe:

fd  rd  user    val

0   2016-02-01  2016-01-01  a   3.0
1   2016-03-01  2016-01-01  NaN NaN
2   2016-04-01  2016-01-01  a   4.0
3   2016-03-01  2016-02-01  a   16.0
4   2016-04-01  2016-02-01  a   7.0
5   2016-05-01  2016-02-01  a   9.0
6   2016-04-01  2016-03-01  NaN NaN
7   2016-05-01  2016-03-01  NaN NaN
8   2016-06-01  2016-03-01  NaN NaN
9   2016-05-01  2016-04-01  NaN NaN
10  2016-06-01  2016-04-01  NaN NaN
11  2016-07-01  2016-04-01  NaN NaN
12  2016-06-01  2016-05-01  NaN NaN
13  2016-07-01  2016-05-01  NaN NaN
14  2016-08-01  2016-05-01  NaN NaN
15  2016-07-01  2016-06-01  NaN NaN
16  2016-08-01  2016-06-01  NaN NaN
17  2016-09-01  2016-06-01  NaN NaN
0   2016-02-01  2016-01-01  NaN NaN
1   2016-03-01  2016-01-01  NaN NaN
2   2016-04-01  2016-01-01  NaN NaN
3   2016-03-01  2016-02-01  NaN NaN
4   2016-04-01  2016-02-01  NaN NaN
5   2016-05-01  2016-02-01  NaN NaN
6   2016-04-01  2016-03-01  NaN NaN
7   2016-05-01  2016-03-01  NaN NaN
8   2016-06-01  2016-03-01  NaN NaN
9   2016-05-01  2016-04-01  NaN NaN
10  2016-06-01  2016-04-01  NaN NaN
11  2016-07-01  2016-04-01  NaN NaN
12  2016-06-01  2016-05-01  b   2.0
13  2016-07-01  2016-05-01  b   5.0
14  2016-08-01  2016-05-01  NaN NaN
15  2016-07-01  2016-06-01  b   20.0
16  2016-08-01  2016-06-01  b   11.0
17  2016-09-01  2016-06-01  b   1.0

Unfortunately I don't think this is the quickest method, but I got what I wanted.

Python: fill missing dates for each group

Answers (2)

Related Questions