Reputation: 341
Similar question to this one, but with some modifications:
Instead of filling in missing dates for each group between the min and max date of the entire column, we only should be filling in the dates between the min and the max of that group, and output a dataframe with the last row in each group
Reproducible example:
x = pd.DataFrame({'dt': ['2016-01-01','2016-01-03', '2016-01-04','2016-01-01','2016-01-01','2016-01-04']
,'amount': [10.0,30.0,40.0,78.0,80.0,82.0]
, 'sub_id': [1,1,1,2,2,2]
})
Visually:
dt sub_id amount
0 2016-01-01 1 10.0
1 2016-01-03 1 30.0
2 2016-01-04 1 40.0
3 2017-01-01 2 78.0
4 2017-01-01 2 80.0
5 2017-01-04 2 82.0
Output I need:
dt sub_id amount
0 2016-01-01 1 10.0
1 2016-01-02 1 10.0
2 2016-01-03 1 30.0
3 2016-01-04 1 40.0
4 2017-01-01 2 80.0
5 2017-01-02 2 80.0
6 2017-01-03 2 80.0
7 2017-01-04 2 82.0
We are grouping by dt and sub_id. As you can see, in sub_id=1, a row was added for 2016-01-02 and amount was imputed at 10.0 as the previous row was 10.0 (Assume data is sorted beforehand to enable this). For sub_id=2 row was added for 2017-01-02 and 2017-01-03 and amount is 80.0 as that was the last row before this date. The first row for 2017-01-01 was also deleted because we just want to keep the last row for each date and sub_id.
Looking for the most efficient way to do this as the real data has millions of rows. I have a current method using lambda functions and applying them across groups of sub_id but I feel like we could do better.
Thanks!
Upvotes: 4
Views: 3671
Reputation: 294218
Getting the date right of course:
x.dt = pd.to_datetime(x.dt)
Then this:
cols = ['dt', 'sub_id']
pd.concat([
d.asfreq('D').ffill(downcast='infer')
for _, d in x.drop_duplicates(cols, keep='last')
.set_index('dt').groupby('sub_id')
]).reset_index()
dt amount sub_id
0 2016-01-01 10 1
1 2016-01-02 10 1
2 2016-01-03 30 1
3 2016-01-04 40 1
4 2016-01-01 80 2
5 2016-01-02 80 2
6 2016-01-03 80 2
7 2016-01-04 82 2
Upvotes: 4
Reputation: 28233
use asfreq
& groupby
first convert dt
to datetime
& get rid of duplicates
then for each group of sub_id
use asfreq('D', method='ffill')
to generate missing dates and impute amounts
finally reset_index
on amount
column as there's a duplicate sub_id
column as well as index.
x.dt = pd.to_datetime(x.dt)
x.drop_duplicates(
['dt', 'sub_id'], 'last'
).groupby('sub_id').apply(
lambda x: x.set_index('dt').asfreq('D', method='ffill')
).amount.reset_index()
# output:
sub_id dt amount
0 1 2016-01-01 10.0
1 1 2016-01-02 10.0
2 1 2016-01-03 30.0
3 1 2016-01-04 40.0
4 2 2016-01-01 80.0
5 2 2016-01-02 80.0
6 2 2016-01-03 80.0
7 2 2016-01-04 82.0
Upvotes: 2
Reputation: 1352
The below works for me and seems pretty efficient, but I can't say if it's efficient enough. It does avoid lambdas tho.
I called your data df
.
Create a base_df
with the entire date / sub_id grid:
import pandas as pd
from itertools import product
base_grid = product(pd.date_range(df['dt'].min(), df['dt'].max(), freq='D'), list(range(df['sub_id'].min(), df['sub_id'].max() + 1, 1)))
base_df = pd.DataFrame(list(base_grid), columns=['dt', 'sub_id'])
Get the max value per dt / sub_id from df
:
max_value_df = df.loc[df.groupby(['dt', 'sub_id'])['amount'].idxmax()]
max_value_df['dt'] = max_value_df['dt'].apply(pd.Timestamp)
Merge base_df on the max values:
merged_df = base_df.merge(max_value_df, how='left', on=['dt', 'sub_id'])
Sort and forward fill the maximal value:
merged_df = merged_df.sort_values(by=['sub_id', 'dt', 'amount'], ascending=True)
merged_df['amount'] = merged_df.groupby(['sub_id'])['amount'].fillna(method='ffill')
Result:
dt sub_id amount
0 2016-01-01 1 10.0
2 2016-01-02 1 10.0
4 2016-01-03 1 30.0
6 2016-01-04 1 40.0
1 2016-01-01 2 80.0
3 2016-01-02 2 80.0
5 2016-01-03 2 80.0
7 2016-01-04 2 82.0
Upvotes: 0
Reputation: 323226
By using resample
with groupby
x.dt=pd.to_datetime(x.dt)
x.set_index('dt').groupby('sub_id').apply(lambda x : x.resample('D').max().ffill()).reset_index(level=1)
Out[265]:
dt amount sub_id
sub_id
1 2016-01-01 10.0 1.0
1 2016-01-02 10.0 1.0
1 2016-01-03 30.0 1.0
1 2016-01-04 40.0 1.0
2 2016-01-01 80.0 2.0
2 2016-01-02 80.0 2.0
2 2016-01-03 80.0 2.0
2 2016-01-04 82.0 2.0
Upvotes: 4