Reputation: 6748
What I have in a dataframe:
email user_name sessions ymo
[email protected] JD 1 2015-03-01
[email protected] JD 2 2015-05-01
What I need:
email user_name sessions ymo
[email protected] JD 0 2015-01-01
[email protected] JD 0 2015-02-01
[email protected] JD 1 2015-03-01
[email protected] JD 0 2015-04-01
[email protected] JD 2 2015-05-01
[email protected] JD 0 2015-06-01
[email protected] JD 0 2015-07-01
[email protected] JD 0 2015-08-01
[email protected] JD 0 2015-09-01
[email protected] JD 0 2015-10-01
[email protected] JD 0 2015-11-01
[email protected] JD 0 2015-12-01
ymo
column are pd.Timestamp
s:
all_ymo
[Timestamp('2015-01-01 00:00:00'),
Timestamp('2015-02-01 00:00:00'),
Timestamp('2015-03-01 00:00:00'),
Timestamp('2015-04-01 00:00:00'),
Timestamp('2015-05-01 00:00:00'),
Timestamp('2015-06-01 00:00:00'),
Timestamp('2015-07-01 00:00:00'),
Timestamp('2015-08-01 00:00:00'),
Timestamp('2015-09-01 00:00:00'),
Timestamp('2015-10-01 00:00:00'),
Timestamp('2015-11-01 00:00:00'),
Timestamp('2015-12-01 00:00:00')]
Unfortunately, this answer: Adding values for missing data combinations in Pandas is not good as it creates duplicates for existing ymo
values.
I tried something like this, but it is extremely slow:
for em in all_emails:
existent_ymo = fill_ymo[fill_ymo['email'] == em]['ymo']
existent_ymo = set([pd.Timestamp(datetime.date(t.year, t.month, t.day)) for t in existent_ymo])
missing_ymo = list(existent_ymo - all_ymo)
multi_ind = pd.MultiIndex.from_product([[em], missing_ymo], names=col_names)
fill_ymo = sessions.set_index(col_names).reindex(multi_ind, fill_value=0).reset_index()
Upvotes: 1
Views: 101
Reputation: 862396
I try create more general solution with periods
:
print (df)
email user_name sessions ymo
0 [email protected] JD 1 2015-03-01
1 [email protected] JD 2 2015-05-01
2 [email protected] AB 1 2015-03-01
3 [email protected] AB 2 2015-05-01
mbeg = pd.period_range('2015-01', periods=12, freq='M')
print (mbeg)
PeriodIndex(['2015-01', '2015-02', '2015-03', '2015-04', '2015-05', '2015-06',
'2015-07', '2015-08', '2015-09', '2015-10', '2015-11', '2015-12'],
dtype='int64', freq='M')
#convert column ymo to period
df.ymo = df.ymo.dt.to_period('m')
#groupby and reindex with filling 0
df = df.groupby(['email','user_name'])
.apply(lambda x: x.set_index('ymo')
.reindex(mbeg, fill_value=0)
.drop(['email','user_name'], axis=1))
.rename_axis(('email','user_name','ymo'))
.reset_index()
print (df)
email user_name ymo sessions
0 [email protected] JD 2015-01 0
1 [email protected] JD 2015-02 0
2 [email protected] JD 2015-03 1
3 [email protected] JD 2015-04 0
4 [email protected] JD 2015-05 2
5 [email protected] JD 2015-06 0
6 [email protected] JD 2015-07 0
7 [email protected] JD 2015-08 0
8 [email protected] JD 2015-09 0
9 [email protected] JD 2015-10 0
10 [email protected] JD 2015-11 0
11 [email protected] JD 2015-12 0
12 [email protected] AB 2015-01 0
13 [email protected] AB 2015-02 0
14 [email protected] AB 2015-03 1
15 [email protected] AB 2015-04 0
16 [email protected] AB 2015-05 2
17 [email protected] AB 2015-06 0
18 [email protected] AB 2015-07 0
19 [email protected] AB 2015-08 0
20 [email protected] AB 2015-09 0
21 [email protected] AB 2015-10 0
22 [email protected] AB 2015-11 0
23 [email protected] AB 2015-12 0
Then if need datetimes
use to_timestamp
:
df.ymo = df.ymo.dt.to_timestamp()
print (df)
email user_name ymo sessions
0 [email protected] JD 2015-01-01 0
1 [email protected] JD 2015-02-01 0
2 [email protected] JD 2015-03-01 1
3 [email protected] JD 2015-04-01 0
4 [email protected] JD 2015-05-01 2
5 [email protected] JD 2015-06-01 0
6 [email protected] JD 2015-07-01 0
7 [email protected] JD 2015-08-01 0
8 [email protected] JD 2015-09-01 0
9 [email protected] JD 2015-10-01 0
10 [email protected] JD 2015-11-01 0
11 [email protected] JD 2015-12-01 0
12 [email protected] AB 2015-01-01 0
13 [email protected] AB 2015-02-01 0
14 [email protected] AB 2015-03-01 1
15 [email protected] AB 2015-04-01 0
16 [email protected] AB 2015-05-01 2
17 [email protected] AB 2015-06-01 0
18 [email protected] AB 2015-07-01 0
19 [email protected] AB 2015-08-01 0
20 [email protected] AB 2015-09-01 0
21 [email protected] AB 2015-10-01 0
22 [email protected] AB 2015-11-01 0
23 [email protected] AB 2015-12-01 0
Solution with datetimes:
print (df)
email user_name sessions ymo
0 [email protected] JD 1 2015-03-01
1 [email protected] JD 2 2015-05-01
2 [email protected] AB 1 2015-03-01
3 [email protected] AB 2 2015-05-01
mbeg = pd.date_range('2015-01-31', periods=12, freq='M') - pd.offsets.MonthBegin()
df = df.groupby(['email','user_name'])
.apply(lambda x: x.set_index('ymo')
.reindex(mbeg, fill_value=0)
.drop(['email','user_name'], axis=1))
.rename_axis(('email','user_name','ymo'))
.reset_index()
print (df)
email user_name ymo sessions
0 [email protected] JD 2015-01-01 0
1 [email protected] JD 2015-02-01 0
2 [email protected] JD 2015-03-01 1
3 [email protected] JD 2015-04-01 0
4 [email protected] JD 2015-05-01 2
5 [email protected] JD 2015-06-01 0
6 [email protected] JD 2015-07-01 0
7 [email protected] JD 2015-08-01 0
8 [email protected] JD 2015-09-01 0
9 [email protected] JD 2015-10-01 0
10 [email protected] JD 2015-11-01 0
11 [email protected] JD 2015-12-01 0
12 [email protected] AB 2015-01-01 0
13 [email protected] AB 2015-02-01 0
14 [email protected] AB 2015-03-01 1
15 [email protected] AB 2015-04-01 0
16 [email protected] AB 2015-05-01 2
17 [email protected] AB 2015-06-01 0
18 [email protected] AB 2015-07-01 0
19 [email protected] AB 2015-08-01 0
20 [email protected] AB 2015-09-01 0
21 [email protected] AB 2015-10-01 0
22 [email protected] AB 2015-11-01 0
23 [email protected] AB 2015-12-01 0
Upvotes: 3
Reputation: 294198
reindex
ffill
and bfill
columns ['email', 'user_name']
fillna(0)
for column 'sessions'
mbeg = pd.date_range('2015-01-31', periods=12, freq='M') - pd.offsets.MonthBegin()
df1 = df.set_index('ymo').reindex(mbeg)
df1[['email', 'user_name']] = df1[['email', 'user_name']].ffill().bfill()
df1['sessions'] = df1['sessions'].fillna(0).astype(int)
df1
Upvotes: 2