Reputation: 25895
I want to group a table such that the two first columns remain as they were when grouped, the 3d is the grouping mean, and the 4th the grouping dispersion, defined in the code. This is how I currently do it:
x = pd.DataFrame(np.array(((1,1,1,1),(1,1,10,2),(2,2,2,2),(2,2,8,3))))
0 1 2 3
0 1 1 1 1
1 1 1 10 2
2 2 2 2 2
3 2 2 8 3
g = x.groupby(0)
res = g.mean()
res[3] = g.apply(lambda x: ((x[2]+x[3]).max()-(x[2]-x[3]).min())*0.5)
res
1 2 3
0
1 1.0 5.5 6.0
2 2.0 5.0 5.5
I am looking to speed this up anyway possible. In particular if I could get rid of apply
and use g
only once that would be great.
For testing purposes, this runs on data sizes of:
Here is a mid-sized sample:
array([[ 0.00000000e+000, 4.70221520e-003, 1.14943038e-003,
3.44829114e-009],
[ 1.81557753e-011, 4.94065646e-324, 4.70221520e-003,
1.14943038e-003],
[ 2.36416931e-008, 1.97231804e-011, 9.88131292e-324,
8.43322640e-003],
[ 1.74911362e-003, 3.43575891e-009, 1.12130677e-010,
1.48219694e-323],
[ 8.43322640e-003, 1.74911362e-003, 3.42014182e-009,
1.11974506e-010],
[ 1.97626258e-323, 4.70221520e-003, 1.14943038e-003,
3.48747627e-009],
[ 1.78945412e-011, 2.47032823e-323, 4.70221520e-003,
1.14943038e-003],
[ 2.32498418e-008, 1.85476266e-010, 2.96439388e-323,
4.70221520e-003],
[ 1.14943038e-003, 3.50053798e-009, 1.85476266e-011,
3.45845952e-323],
[ 4.70221520e-003, 1.14943038e-003, 4.53241298e-008,
3.00419304e-010],
[ 3.95252517e-323, 4.70221520e-003, 1.14943038e-003,
3.55278482e-009],
[ 1.80251583e-011, 4.44659081e-323, 4.70221520e-003,
1.14943038e-003],
[ 1.09587738e-008, 1.68496045e-011, 4.94065646e-323,
4.70221520e-003],
[ 1.14943038e-003, 3.48747627e-009, 1.80251583e-011,
5.43472210e-323],
[ 4.70221520e-003, 1.14943038e-003, 3.90545096e-008,
2.63846519e-010],
[ 5.92878775e-323, 8.43322640e-003, 1.74911362e-003,
3.15465136e-009],
[ 1.04009792e-010, 6.42285340e-323, 8.43322640e-003,
1.74911362e-003],
[ 2.56120209e-010, 4.15414486e-011, 6.91691904e-323,
8.43322640e-003],
[ 1.74911362e-003, 3.43575891e-009, 1.12286848e-010,
7.41098469e-323],
[ 8.43322640e-003, 1.74911362e-003, 5.91887557e-009,
1.45863583e-010],
[ 7.90505033e-323, 8.43322640e-003, 1.74911362e-003,
3.34205639e-009],
[ 1.07133209e-010, 8.39911598e-323, 8.43322640e-003,
1.74911362e-003],
[ 1.21188587e-009, 7.07453993e-011, 8.89318163e-323,
8.43322640e-003],
[ 1.74911362e-003, 3.38890765e-009, 1.12130677e-010,
9.38724727e-323],
[ 8.43322640e-003, 1.74911362e-003, 1.79596488e-009,
8.38637515e-011]])
Upvotes: 2
Views: 35
Reputation: 862611
You can use syntacting sugar - .groupby
with Series
:
res[3] = ((x[2] + x[3]).groupby(x[0]).max() - (x[2] - x[3]).groupby(x[0]).min())*.5
print (res)
1 2 3
0
1 1.0 5.5 6.0
2 2.0 5.0 5.5
I get this timigs with you array:
In [279]: %%timeit
...: res = x.groupby(0).mean()
...: res[3] = ((x[2] + x[3]).groupby(x[0]).max() - (x[2] - x[3]).groupby(x[0]).min())*.5
...:
4.26 ms ± 62.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [280]: %%timeit
...: g = x.groupby(0)
...: res = g.mean()
...: res[3] = g.apply(lambda x: ((x[2]+x[3]).max()-(x[2]-x[3]).min())*0.5)
...:
11 ms ± 76.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Also if turn off sorting by grouping column if possible:
In [283]: %%timeit
...: res = x.groupby(0, sort=False).mean()
...: res[3] = ((x[2] + x[3]).groupby(x[0], sort=False).max() - (x[2] - x[3]).groupby(x[0], sort=False).min())*.5
...:
4.1 ms ± 50.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Upvotes: 2