Collapsing a Dataframe with Pandas to a single row per group

Question

I have a dataframe like looks like this:

id A    B     date
a  1    NAN 2016-01-01
a  NAN  6   2016-01-02
a  7    NAN 2016-10-01
b  3    5   2016-12-01

I'd like to collapse by date and id (each group goes to a single row, using data from the last date in the group, and filling missing data with the most recent data in the group). The groupby would roughly look like this:

df.groupby(['id', pd.Grouper(key='date', freq='30D')])

This would result in a dataframe like so:

id A    B     date
a  1    6   2016-01-02
a  7    NAN 2016-10-01
b  3    5   2016-12-01

A very simple example. I believe my groupby would work, but I'm not sure how to combine the rows into a single row, and aggregate the data using the most recent data from each column, forward filling data if necessary (if they are all NANs, just use a NAN).

sundance · Accepted Answer

df.groupby(['id', pd.Grouper(key='date', freq='30D')]).apply(lambda g: g[["A", "B"]].ffill().iloc[-1])

Result:

               A    B
id date
a  2016-01-01  1  6.0
   2016-09-27  7  NaN
b  2016-11-26  3  5.0

EDIT: To address your comment, if you'd like to avoid the apply, you could also use two groupbys:

groupers = ['id', pd.Grouper(key='date', freq='30D')]
df.groupby(groupers).ffill().groupby(groupers).last()

Result:

               A    B
id date
a  2016-01-01  1    6
   2016-09-27  7  NaN
b  2016-11-26  3    5

It looks like performance is similar to the apply:

In [1]: groupers = ['id', pd.Grouper(key='date', freq='30D')]

In [2]: %%timeit
    ...: df.groupby(groupers).ffill().groupby(groupers).last()
    ...:
100 loops, best of 3: 9.79 ms per loop

In [3]: %%timeit
    ...: df.groupby(groupers).apply(lambda g: g[["A", "B"]].ffill().iloc[-1])
    ...:
100 loops, best of 3: 10.5 ms per loop

Collapsing a Dataframe with Pandas to a single row per group

Answers (1)

Related Questions