Felix Gao
Felix Gao

Reputation: 43

pandas dataframe fill missing rows with date column

    uuid  site          ts  visit
0  +CW99  1124  2013-06-24     2
1  +CW99  1124  2013-06-26     1
2  +CW99  1124  2013-06-27     1
3  +CW99  1124  2013-06-20     1
4  +CW99  1124  2013-06-21     1
5  +CW99  1124  2013-06-24     2
6  +CW9W   956  2013-06-21     4
7  +CW9W   956  2013-06-22     2
8  +CW9W   956  2013-06-23     3
9  +CW9W   956  2013-06-24     4

I have a dataframe that is passed in looks like that. I don't really know how many distinct ts period are there. So I want to find the min and max of the ts then expands the rows that does not have all the ts (daily) to zero fill the visit column while copying the rest of the fields

        uuid  site          ts  visit
    0  +CW99  1124  2013-06-20     1
    1  +CW99  1124  2013-06-21     1
    2  +CW99  1124  2013-06-22     1
    3  +CW99  1124  2013-06-23     0
    4  +CW99  1124  2013-06-24     2
    5  +CW99  1124  2013-06-25     0
    6  +CW99  1124  2013-06-26     1
    7  +CW99  1124  2013-06-27     1
    8  +CW9W   956  2013-06-20     0
    9  +CW9W   956  2013-06-21     4
   10  +CW9W   956  2013-06-22     2
   11  +CW9W   956  2013-06-23     3
   12  +CW9W   956  2013-06-24     4
   13  +CW9W   956  2013-06-25     0
   14  +CW9W   956  2013-06-26     0
   15  +CW9W   956  2013-06-27     0

Thanks.

Upvotes: 4

Views: 750

Answers (1)

Jeff
Jeff

Reputation: 128918

This is non-trivial. I'll explain why below

Prelim, read in the original data frame and make sure that the ts column is a dtype of datetime64[ns]

# you may need to do this to get the correct dtype   
df['ts'] = df['ts'].to_datetime(df['ts'])

In [107]: df
Out[107]: 
    uuid  site                  ts  visit
0  +CW99  1124 2013-06-24 00:00:00      2
1  +CW99  1124 2013-06-26 00:00:00      1
2  +CW99  1124 2013-06-27 00:00:00      1
3  +CW99  1124 2013-06-20 00:00:00      1
4  +CW99  1124 2013-06-21 00:00:00      1
5  +CW99  1124 2013-06-24 00:00:00      2
6  +CW9W   956 2013-06-21 00:00:00      4
7  +CW9W   956 2013-06-22 00:00:00      2
8  +CW9W   956 2013-06-23 00:00:00      3
9  +CW9W   956 2013-06-24 00:00:00      4

In [106]: df.dtypes
Out[106]: 
uuid             object
site              int64
ts       datetime64[ns]
visit             int64
dtype: object

Create the master times between min and max

In [110]: all_ts = pd.date_range(df['ts'].min(),df['ts'].max())

In [111]: all_ts
Out[111]: 
<class 'pandas.tseries.index.DatetimeIndex'>
[2013-06-20 00:00:00, ..., 2013-06-27 00:00:00]
Length: 8, Freq: D, Timezone: None

Define a function like this

In [103]: def f(x):

              # we want all of the ``ts`` column that are not in the master time series
   .....:     adf = DataFrame(dict(ts = all_ts-Index(x['ts'])),columns=df.columns)

              # they should have visit of 0
   .....:     adf['visit'] = 0

              # first add them to the frame (x), ignoring the index
              # sort by the ts column
              # then fillforward missing values
   .....:     return x.append(adf,ignore_index=True).sort_index(by='ts').ffill()
   .....: 

Apply the function (you could group by uuid,site if you want as well)

In [116]: df.groupby('uuid').apply(f)
Out[116]: 
          uuid  site                  ts  visit
uuid                                           
+CW99 3  +CW99  1124 2013-06-20 00:00:00      1
      4  +CW99  1124 2013-06-21 00:00:00      1
      0  +CW99  1124 2013-06-24 00:00:00      2
      5  +CW99  1124 2013-06-24 00:00:00      2
      6  +CW99  1124 2013-06-25 00:00:00      0
      1  +CW99  1124 2013-06-26 00:00:00      1
      2  +CW99  1124 2013-06-27 00:00:00      1
+CW9W 0  +CW9W   956 2013-06-21 00:00:00      4
      1  +CW9W   956 2013-06-22 00:00:00      2
      2  +CW9W   956 2013-06-23 00:00:00      3
      3  +CW9W   956 2013-06-24 00:00:00      4
      4  +CW9W   956 2013-06-25 00:00:00      0

Note: you have a duplicate in the posted frame. Not sure if that was intential or not I preserved that. This is a somewhat easier problem if you don't have duplicates (in the ts column)

Here is the no dup way

In [207]: def f(x):
   .....:     x = x.set_index('ts').reindex(all_ts).reset_index()
   .....:     x['visit'] = x['visit'].fillna(0)
   .....:     return x.ffill()
   .....: 

In [208]: df_no_dups.groupby('uuid').apply(f)
Out[208]: 
                      index   uuid  site  visit
uuid                                           
+CW99 0 2013-06-20 00:00:00  +CW99  1124      1
      1 2013-06-21 00:00:00  +CW99  1124      1
      2 2013-06-22 00:00:00  +CW99  1124      0
      3 2013-06-23 00:00:00  +CW99  1124      0
      4 2013-06-24 00:00:00  +CW99  1124      2
      5 2013-06-25 00:00:00  +CW99  1124      0
      6 2013-06-26 00:00:00  +CW99  1124      1
      7 2013-06-27 00:00:00  +CW99  1124      1
+CW9W 0 2013-06-20 00:00:00    NaN   NaN      0
      1 2013-06-21 00:00:00  +CW9W   956      4
      2 2013-06-22 00:00:00  +CW9W   956      2
      3 2013-06-23 00:00:00  +CW9W   956      3
      4 2013-06-24 00:00:00  +CW9W   956      4
      5 2013-06-25 00:00:00  +CW9W   956      0
      6 2013-06-26 00:00:00  +CW9W   956      0
      7 2013-06-27 00:00:00  +CW9W   956      0

This forces all the elements to be there (note the NaN becuase no way to ffill on the first element). You could drop these if you want.

Upvotes: 4

Related Questions