Reputation: 20155
My goal is to complement the missing date entries per project_id
with 0 in the data
row.
For example
df = pd.DataFrame({
'project_id': ['A', 'A', 'A', 'B', 'B'],
'timestamp': ['2018-01-01', '2018-03-01', '2018-04-01', '2018-03-01', '2018-06-01'],
'data': [100, 28, 45, 64, 55]})
which is
project_id timestamp data
0 A 2018-01-01 100
1 A 2018-03-01 28
2 A 2018-04-01 45
3 B 2018-03-01 64
4 B 2018-06-01 55
shall become
project_id timestamp data
0 A 2018-01-01 100
1 A 2018-02-01 0
2 A 2018-03-01 28
3 A 2018-04-01 45
4 B 2018-03-01 64
5 B 2018-04-01 0
6 B 2018-05-01 0
7 B 2018-06-01 55
where indices 1, 5, and 6 are added.
My current approach :
df.groupby('project_id').apply(lambda x: x[['timestamp', 'data']].set_index('timestamp').asfreq('M', how='start', fill_value=0))
is obviously wrong, because it sets everything to 0
and resampled not the first date of a month but the last one - although I thought this should be handled by how
.
How do I expand/complement missing datetime
entries after groupby
to get a continuous time series for each group?
Upvotes: 1
Views: 155
Reputation: 42916
You can use groupby
in combination with pandas.Grouper
:
df_new = pd.concat([
d for n, d in df.set_index('timestamp').groupby(pd.Grouper(freq='MS'))
])
df_new = df_new.sort_values('project_id').reset_index()
Output
print(df_new)
timestamp project_id data
0 2018-01-01 A 100.0
1 2018-02-01 A 0.0
2 2018-03-01 A 28.0
3 2018-04-01 A 45.0
4 2018-03-01 B 64.0
5 2018-04-01 B 0.0
6 2018-05-01 B 0.0
7 2018-06-01 B 55.0
Upvotes: 2
Reputation: 150745
You are close:
df.timestamp = pd.to_datetime(df.timestamp)
# notice 'MS'
new_df = df.groupby('project_id').apply(lambda x: x[['timestamp', 'data']]
.set_index('timestamp').asfreq('MS'))
new_df.data = df.set_index(['project_id', 'timestamp']).data
df = new_df.fillna(0).reset_index()
Upvotes: 3