hedebyhedge
hedebyhedge

Reputation: 455

How to fill in arbitrary missing dates in Pandas dataframe?

My current dataframe:

   |   user  |       date        | count|
   | ------  | ------------------| -----|
   | user 1  | 2017-11-09        |   3  |
   | user 1  | 2017-11-12        |   4  |
   | user 1  | 2017-11-14        |   5  |
   | user 2  | 2017-11-11        |   2  |
   | user 2  | 2017-11-12        |   6  |
   | user 2  | 2017-11-14        |   7  |

What I would like it to be:

   |   user  |       date        | count|
   | ------  | ------------------| -----|
   | user 1  | 2017-11-09        |   3  |
   | user 1  | 2017-11-10        |   0  |
   | user 1  | 2017-11-11        |   0  |
   | user 1  | 2017-11-12        |   4  |
   | user 1  | 2017-11-13        |   0  | 
   | user 1  | 2017-11-14        |   5  |
   | user 2  | 2017-11-09        |   0  |
   | user 2  | 2017-11-10        |   0  |
   | user 2  | 2017-11-11        |   2  |
   | user 2  | 2017-11-12        |   6  |
   | user 2  | 2017-11-13        |   0  | 
   | user 2  | 2017-11-14        |   7  |

Any suggestions? I've tried doing the resample method, but I'm not sure that it works. Also, my dates are already converted to datetime objects.

EDIT: I'm thankful for the answers, but I've realized that my date column has multiple of the same dates, and indexing it is now an issue, because there are duplicate dates. Is there a way to approach the problem that doesn't require indexing (or gets around it?)

Upvotes: 1

Views: 85

Answers (3)

jezrael
jezrael

Reputation: 862406

Use asfreq working with DatetimeIndex, so first set_index:

#if necessary convert to datetimes
#df['date'] = pd.to_datetime(df['date'])

df = df.set_index('date').asfreq('d', fill_value=0)
print (df)
            count
date             
2017-11-09      3
2017-11-10      0
2017-11-11      0
2017-11-12      4
2017-11-13      0
2017-11-14      5

If there are duplicated DatetimeIndex, is necessary resample with some aggregate function like sum:

df = df.set_index('date').resample('d').sum()

EDIT if dont need all combinations of dates:

df = (df.set_index('date').groupby('user')['count']
        .apply(lambda x: x.asfreq('d', fill_value=0))
        .reset_index())
print (df)
     user       date  count
0  user 1 2017-11-09      3
1  user 1 2017-11-10      0
2  user 1 2017-11-11      0
3  user 1 2017-11-12      4
4  user 1 2017-11-13      0
5  user 1 2017-11-14      5
6  user 2 2017-11-11      2
7  user 2 2017-11-12      6
8  user 2 2017-11-13      0
9  user 2 2017-11-14      7

EDIT1 For all combinations of dates:

mux = pd.MultiIndex.from_product([df['user'].unique(), pd.date_range(df['date'].min(), df['date'].max())],
                                 names=['user','date'])
df = df.set_index(['user', 'date']).reindex(mux, fill_value=0).reset_index()
print (df)
      user       date  count
0   user 1 2017-11-09      3
1   user 1 2017-11-10      0
2   user 1 2017-11-11      0
3   user 1 2017-11-12      4
4   user 1 2017-11-13      0
5   user 1 2017-11-14      5
6   user 2 2017-11-09      0
7   user 2 2017-11-10      0
8   user 2 2017-11-11      2
9   user 2 2017-11-12      6
10  user 2 2017-11-13      0
11  user 2 2017-11-14      7

Upvotes: 4

Bubble Bubble Bubble Gut
Bubble Bubble Bubble Gut

Reputation: 3358

@jezrael's answer is really awesome! Just to add a bit if you want to use an arbitrary range of date:

more_dates = pd.date_range('20171101', '20171120')
df = df.reindex(more_dates, fill_value=0)

            count
2017-11-01    0.0
2017-11-02    0.0
2017-11-03    0.0
2017-11-04    0.0
2017-11-05    0.0
2017-11-06    0.0
2017-11-07    0.0
2017-11-08    0.0
2017-11-09    3.0
2017-11-10    0.0
2017-11-11    0.0
2017-11-12    4.0
2017-11-13    0.0
2017-11-14    5.0
2017-11-15    0.0
2017-11-16    0.0
2017-11-17    0.0
2017-11-18    0.0
2017-11-19    0.0
2017-11-20    0.0

Thanks @jezrael for the advice, edited.

Upvotes: 1

rafaelc
rafaelc

Reputation: 59264

You can create a date_range and use merge

Example:

>>> pd.date_range(start=df.date.min(), end=df.date.max(), freq='1D')

DatetimeIndex(['2017-11-09', '2017-11-10', '2017-11-11', '2017-11-12',
               '2017-11-13', '2017-11-14'],
              dtype='datetime64[ns]', freq='D')

Then

df2 = pd.DataFrame((pd.date_range(start=df.date.min(), end=df.date.max(), freq='1D')), columns=["date"])
pd.merge(df2,df, on="date", how="left").fillna(0)

date    count
0   2017-11-09  3.0
1   2017-11-10  0.0
2   2017-11-11  0.0
3   2017-11-12  4.0
4   2017-11-13  0.0
5   2017-11-14  5.0

Upvotes: 2

Related Questions