user328349
user328349

Reputation: 11

Pandas groupby to expanding to list (numpy array)

I have a DataFrame that can be produced using this Python code:

import pandas as pd

df = pd.DataFrame({'visit': [1] * 6 + [2] * 6,
                   'time': [t for t in range(6)] * 2,
                   'observations': [o for o in range(12)]})

The following code enables me to reformat the data as desired:

dflist = []
for v_ in df.visit.unique():
    for t_ in df.time[df.visit == v_]:
        dflist.append([df[(df.visit == v_) & (df.time <= t_)].groupby('visit')['observations'].apply(list)])

pd.DataFrame(pd.concat([df[0] for df in dflist], axis=0))

However this is extremely slow.

I have tried using .expanding(), however, this will only return scalars whereas I would like list (or numpy array).

I would appreciate any help in vectorizing or otherwise optimizing this procedure.

Thanks

Upvotes: 1

Views: 228

Answers (2)

Henry Ecker
Henry Ecker

Reputation: 35636

Fortunately, in pandas 1.1.0 and newer, expanding produces an iterable which can be used to use take advantage of the faster grouping, but produce non-scaler data like lists:

new_df = pd.DataFrame({
    'observations':
        [list(x) for x in df.groupby('visit')['observations'].expanding()]
}, index=df['visit'])

new_df:

               observations
visit                      
1                       [0]
1                    [0, 1]
1                 [0, 1, 2]
1              [0, 1, 2, 3]
1           [0, 1, 2, 3, 4]
1        [0, 1, 2, 3, 4, 5]
2                       [6]
2                    [6, 7]
2                 [6, 7, 8]
2              [6, 7, 8, 9]
2          [6, 7, 8, 9, 10]
2      [6, 7, 8, 9, 10, 11]

Timing via %timeit:

Setup:

import pandas as pd
df = pd.DataFrame({'visit': [1] * 6 + [2] * 6,
                   'time': [t for t in range(6)] * 2,
                   'observations': [o for o in range(12)]})

Original:

def fn():
    dflist = []
    for v_ in df.visit.unique():
        for t_ in df.time[df.visit == v_]:
            dflist.append([
                df[(df.visit == v_) & (df.time <= t_)]
                    .groupby('visit')['observations'].apply(list)
            ])

    return pd.DataFrame(pd.concat([df[0] for df in dflist], axis=0))

%timeit fn()
13 ms ± 692 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

List comprehension with expanding (~13x faster on this sample):

def fn2():
    return pd.DataFrame({
        'observations':
            [list(x) for x in df.groupby('visit')['observations'].expanding()]
    }, index=df['visit'])

%timeit fn2()
967 µs ± 57.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Sanity Check:

fn().eq(fn2()).all(axis=None)  # True

The double apply approach by @Quixotic22 (~3.4x faster than the original ~3.9x slower than comprehension + expanding):

def fn3():
    return (df.
            set_index('visit')['observations'].
            apply(lambda x: [x]).
            reset_index().groupby('visit')['observations'].
            apply(lambda x: x.cumsum()))

%timeit fn3()
3.78 ms ± 1.07 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

*Note this approach only produces a series of observations, does not include the visit as the index.

fn().eq(fn3()).all(axis=None)  # False

Upvotes: 2

Quixotic22
Quixotic22

Reputation: 2924

Looks like a good solution has been provided but dropping this here as a viable alternative.

(df.
 set_index('visit')['observations'].
 apply(lambda x: [x]).
 reset_index().groupby('visit')['observations'].
 apply(lambda x: x.cumsum())
 )

Upvotes: 1

Related Questions