Reputation: 3601
I have a dataframe like looks like this:
id A B date
a 1 NAN 2016-01-01
a NAN 6 2016-01-02
a 7 NAN 2016-10-01
b 3 5 2016-12-01
I'd like to collapse by date and id (each group goes to a single row, using data from the last date in the group, and filling missing data with the most recent data in the group). The groupby would roughly look like this:
df.groupby(['id', pd.Grouper(key='date', freq='30D')])
This would result in a dataframe like so:
id A B date
a 1 6 2016-01-02
a 7 NAN 2016-10-01
b 3 5 2016-12-01
A very simple example. I believe my groupby would work, but I'm not sure how to combine the rows into a single row, and aggregate the data using the most recent data from each column, forward filling data if necessary (if they are all NANs, just use a NAN).
Upvotes: 0
Views: 629
Reputation: 2955
df.groupby(['id', pd.Grouper(key='date', freq='30D')]).apply(lambda g: g[["A", "B"]].ffill().iloc[-1])
Result:
A B
id date
a 2016-01-01 1 6.0
2016-09-27 7 NaN
b 2016-11-26 3 5.0
EDIT: To address your comment, if you'd like to avoid the apply
, you could also use two groupby
s:
groupers = ['id', pd.Grouper(key='date', freq='30D')]
df.groupby(groupers).ffill().groupby(groupers).last()
Result:
A B
id date
a 2016-01-01 1 6
2016-09-27 7 NaN
b 2016-11-26 3 5
It looks like performance is similar to the apply
:
In [1]: groupers = ['id', pd.Grouper(key='date', freq='30D')]
In [2]: %%timeit
...: df.groupby(groupers).ffill().groupby(groupers).last()
...:
100 loops, best of 3: 9.79 ms per loop
In [3]: %%timeit
...: df.groupby(groupers).apply(lambda g: g[["A", "B"]].ffill().iloc[-1])
...:
100 loops, best of 3: 10.5 ms per loop
Upvotes: 2