Reputation: 4846
I've a data frame that looks like the following
x = pd.DataFrame({'user': ['a','a','b','b'], 'dt': ['2016-01-01','2016-01-02', '2016-01-05','2016-01-06'], 'val': [1,33,2,1]})
What I would like to be able to do is find the minimum and maximum date within the date column and expand that column to have all the dates there while simultaneously filling in 0
for the val
column. So the desired output is
dt user val
0 2016-01-01 a 1
1 2016-01-02 a 33
2 2016-01-03 a 0
3 2016-01-04 a 0
4 2016-01-05 a 0
5 2016-01-06 a 0
6 2016-01-01 b 0
7 2016-01-02 b 0
8 2016-01-03 b 0
9 2016-01-04 b 0
10 2016-01-05 b 2
11 2016-01-06 b 1
I've tried the solution mentioned here and here but they aren't what I'm after. Any pointers much appreciated.
Upvotes: 48
Views: 33228
Reputation: 28729
An old question, with already excellent answers; this is an alternative, using the complete function from pyjanitor that could help with the abstraction when generating explicitly missing rows:
#pip install pyjanitor
import pandas as pd
import janitor as jn
x['dt'] = pd.to_datetime(x['dt'])
# generate complete list of dates
dates = dict(dt = pd.date_range(x.dt.min(), x.dt.max(), freq='1D'))
# build the new dataframe, and fill nulls with 0
x.complete('user', dates, fill_value = 0)
user dt val
0 a 2016-01-01 1
1 a 2016-01-02 33
2 a 2016-01-03 0
3 a 2016-01-04 0
4 a 2016-01-05 0
5 a 2016-01-06 0
6 b 2016-01-01 0
7 b 2016-01-02 0
8 b 2016-01-03 0
9 b 2016-01-04 0
10 b 2016-01-05 2
11 b 2016-01-06 1
Upvotes: 2
Reputation:
Initial Dataframe:
dt user val
0 2016-01-01 a 1
1 2016-01-02 a 33
2 2016-01-05 b 2
3 2016-01-06 b 1
First, convert the dates to datetime:
x['dt'] = pd.to_datetime(x['dt'])
Then, generate the dates and unique users:
dates = x.set_index('dt').resample('D').asfreq().index
>> DatetimeIndex(['2016-01-01', '2016-01-02', '2016-01-03', '2016-01-04',
'2016-01-05', '2016-01-06'],
dtype='datetime64[ns]', name='dt', freq='D')
users = x['user'].unique()
>> array(['a', 'b'], dtype=object)
This will allow you to create a MultiIndex:
idx = pd.MultiIndex.from_product((dates, users), names=['dt', 'user'])
>> MultiIndex(levels=[[2016-01-01 00:00:00, 2016-01-02 00:00:00, 2016-01-03 00:00:00, 2016-01-04 00:00:00, 2016-01-05 00:00:00, 2016-01-06 00:00:00], ['a', 'b']],
labels=[[0, 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5], [0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1]],
names=['dt', 'user'])
You can use that to reindex your DataFrame:
x.set_index(['dt', 'user']).reindex(idx, fill_value=0).reset_index()
Out:
dt user val
0 2016-01-01 a 1
1 2016-01-01 b 0
2 2016-01-02 a 33
3 2016-01-02 b 0
4 2016-01-03 a 0
5 2016-01-03 b 0
6 2016-01-04 a 0
7 2016-01-04 b 0
8 2016-01-05 a 0
9 2016-01-05 b 2
10 2016-01-06 a 0
11 2016-01-06 b 1
which then can be sorted by users:
x.set_index(['dt', 'user']).reindex(idx, fill_value=0).reset_index().sort_values(by='user')
Out:
dt user val
0 2016-01-01 a 1
2 2016-01-02 a 33
4 2016-01-03 a 0
6 2016-01-04 a 0
8 2016-01-05 a 0
10 2016-01-06 a 0
1 2016-01-01 b 0
3 2016-01-02 b 0
5 2016-01-03 b 0
7 2016-01-04 b 0
9 2016-01-05 b 2
11 2016-01-06 b 1
Upvotes: 49
Reputation: 294506
As @ayhan suggests
x.dt = pd.to_datetime(x.dt)
One-liner using mostly @ayhan's ideas while incorporating stack
/unstack
and fill_value
x.set_index(
['dt', 'user']
).unstack(
fill_value=0
).asfreq(
'D', fill_value=0
).stack().sort_index(level=1).reset_index()
dt user val
0 2016-01-01 a 1
1 2016-01-02 a 33
2 2016-01-03 a 0
3 2016-01-04 a 0
4 2016-01-05 a 0
5 2016-01-06 a 0
6 2016-01-01 b 0
7 2016-01-02 b 0
8 2016-01-03 b 0
9 2016-01-04 b 0
10 2016-01-05 b 2
11 2016-01-06 b 1
Upvotes: 46