Reputation: 11
I have a DataFrame that can be produced using this Python code:
import pandas as pd
df = pd.DataFrame({'visit': [1] * 6 + [2] * 6,
'time': [t for t in range(6)] * 2,
'observations': [o for o in range(12)]})
The following code enables me to reformat the data as desired:
dflist = []
for v_ in df.visit.unique():
for t_ in df.time[df.visit == v_]:
dflist.append([df[(df.visit == v_) & (df.time <= t_)].groupby('visit')['observations'].apply(list)])
pd.DataFrame(pd.concat([df[0] for df in dflist], axis=0))
However this is extremely slow.
I have tried using .expanding()
, however, this will only return scalars whereas I would like list (or numpy array).
I would appreciate any help in vectorizing or otherwise optimizing this procedure.
Thanks
Upvotes: 1
Views: 228
Reputation: 35636
Fortunately, in pandas
1.1.0 and newer, expanding
produces an iterable which can be used to use take advantage of the faster grouping, but produce non-scaler data like lists:
new_df = pd.DataFrame({
'observations':
[list(x) for x in df.groupby('visit')['observations'].expanding()]
}, index=df['visit'])
new_df
:
observations
visit
1 [0]
1 [0, 1]
1 [0, 1, 2]
1 [0, 1, 2, 3]
1 [0, 1, 2, 3, 4]
1 [0, 1, 2, 3, 4, 5]
2 [6]
2 [6, 7]
2 [6, 7, 8]
2 [6, 7, 8, 9]
2 [6, 7, 8, 9, 10]
2 [6, 7, 8, 9, 10, 11]
Timing via %timeit:
Setup:
import pandas as pd
df = pd.DataFrame({'visit': [1] * 6 + [2] * 6,
'time': [t for t in range(6)] * 2,
'observations': [o for o in range(12)]})
Original:
def fn():
dflist = []
for v_ in df.visit.unique():
for t_ in df.time[df.visit == v_]:
dflist.append([
df[(df.visit == v_) & (df.time <= t_)]
.groupby('visit')['observations'].apply(list)
])
return pd.DataFrame(pd.concat([df[0] for df in dflist], axis=0))
%timeit fn()
13 ms ± 692 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
List comprehension with expanding
(~13x faster on this sample):
def fn2():
return pd.DataFrame({
'observations':
[list(x) for x in df.groupby('visit')['observations'].expanding()]
}, index=df['visit'])
%timeit fn2()
967 µs ± 57.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Sanity Check:
fn().eq(fn2()).all(axis=None) # True
The double apply approach by @Quixotic22 (~3.4x faster than the original ~3.9x slower than comprehension + expanding):
def fn3():
return (df.
set_index('visit')['observations'].
apply(lambda x: [x]).
reset_index().groupby('visit')['observations'].
apply(lambda x: x.cumsum()))
%timeit fn3()
3.78 ms ± 1.07 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
*Note this approach only produces a series of observations, does not include the visit as the index.
fn().eq(fn3()).all(axis=None) # False
Upvotes: 2
Reputation: 2924
Looks like a good solution has been provided but dropping this here as a viable alternative.
(df.
set_index('visit')['observations'].
apply(lambda x: [x]).
reset_index().groupby('visit')['observations'].
apply(lambda x: x.cumsum())
)
Upvotes: 1