Reputation: 1053
I have a dataframe like this
data
0 1.5
1 1.3
2 1.3
3 1.8
4 1.3
5 1.8
6 1.5
And I have a list of lists like this:
indices = [[0, 3, 4], [0, 3], [2, 6, 4], [1, 3, 4, 5]]
I want to produce sums of each of the groups in my dataframe using the list of lists, so
group1 = df[0] + df[1] + df[2]
group2 = df[1] + df[2] + df[3]
group3 = df[2] + df[3] + df[4]
group4 = df[3] + df[4] + df[5]
so I am looking for something like df.groupby(indices).sum
I know this can be done iteratively using a for loop and applying the sum to each of the df.iloc[sublist],
but I am looking for a faster way.
Upvotes: 1
Views: 461
Reputation: 862481
Use list comprehension:
a = [df.loc[x, 'data'].sum() for x in indices]
print (a)
[4.6, 3.3, 4.1, 6.2]
arr = df['data'].values
a = [arr[x].sum() for x in indices]
print (a)
[4.6, 3.3, 4.1, 6.2]
Solution with groupby + sum
is possible, but not sure if better performance:
df1 = pd.DataFrame({
'd' : df['data'].values[np.concatenate(indices)],
'g' : np.arange(len(indices)).repeat([len(x) for x in indices])
})
print (df1)
d g
0 1.5 0
1 1.8 0
2 1.3 0
3 1.5 1
4 1.8 1
5 1.3 2
6 1.5 2
7 1.3 2
8 1.3 3
9 1.8 3
10 1.3 3
11 1.8 3
print(df1.groupby('g')['d'].sum())
g
0 4.6
1 3.3
2 4.1
3 6.2
Name: d, dtype: float64
Performance tested in small sample data - in real data should be different:
In [150]: %timeit [df.loc[x, 'data'].sum() for x in indices]
4.84 ms ± 80.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [151]: %%timeit
...: df['data'].values
...: [arr[x].sum() for x in indices]
...:
...:
20.9 µs ± 99.3 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [152]: %timeit pd.DataFrame({'d' : df['data'].values[np.concatenate(indices)],'g' : np.arange(len(indices)).repeat([len(x) for x in indices])}).groupby('g')['d'].sum()
1.46 ms ± 234 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
On real data
In [37]: %timeit [df.iloc[x, 0].sum() for x in indices]
158 ms ± 485 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [38]: arr = df['data'].values
...: %timeit \
...: [arr[x].sum() for x in indices]
5.99 ms ± 18 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In[49]: %timeit pd.DataFrame({'d' : df['last'].values[np.concatenate(sample_indices['train'])],'g' : np.arange(len(sample_indices['train'])).repeat([len(x) for x in sample_indices['train']])}).groupby('g')['d'].sum()
...:
5.97 ms ± 45.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
interesting.. both of the bottom answers are fast.
Upvotes: 1