kabanus
kabanus

Reputation: 25895

Speed up grouping with `apply` on a single column

I want to group a table such that the two first columns remain as they were when grouped, the 3d is the grouping mean, and the 4th the grouping dispersion, defined in the code. This is how I currently do it:

x = pd.DataFrame(np.array(((1,1,1,1),(1,1,10,2),(2,2,2,2),(2,2,8,3))))

   0  1   2  3
0  1  1   1  1
1  1  1  10  2
2  2  2   2  2
3  2  2   8  3

g      = x.groupby(0)
res    = g.mean()
res[3] = g.apply(lambda x: ((x[2]+x[3]).max()-(x[2]-x[3]).min())*0.5)
res

     1    2    3
0               
1  1.0  5.5  6.0
2  2.0  5.0  5.5

I am looking to speed this up anyway possible. In particular if I could get rid of apply and use g only once that would be great.

For testing purposes, this runs on data sizes of:

  1. A few to 60 rows
  2. 1-5 groups (there could be a single group)
  3. 4 columns

Here is a mid-sized sample:

array([[  0.00000000e+000,   4.70221520e-003,   1.14943038e-003,
      3.44829114e-009],
   [  1.81557753e-011,   4.94065646e-324,   4.70221520e-003,
      1.14943038e-003],
   [  2.36416931e-008,   1.97231804e-011,   9.88131292e-324,
      8.43322640e-003],
   [  1.74911362e-003,   3.43575891e-009,   1.12130677e-010,
      1.48219694e-323],
   [  8.43322640e-003,   1.74911362e-003,   3.42014182e-009,
      1.11974506e-010],
   [  1.97626258e-323,   4.70221520e-003,   1.14943038e-003,
      3.48747627e-009],
   [  1.78945412e-011,   2.47032823e-323,   4.70221520e-003,
      1.14943038e-003],
   [  2.32498418e-008,   1.85476266e-010,   2.96439388e-323,
      4.70221520e-003],
   [  1.14943038e-003,   3.50053798e-009,   1.85476266e-011,
      3.45845952e-323],
   [  4.70221520e-003,   1.14943038e-003,   4.53241298e-008,
      3.00419304e-010],
   [  3.95252517e-323,   4.70221520e-003,   1.14943038e-003,
      3.55278482e-009],
   [  1.80251583e-011,   4.44659081e-323,   4.70221520e-003,
      1.14943038e-003],
   [  1.09587738e-008,   1.68496045e-011,   4.94065646e-323,
      4.70221520e-003],
   [  1.14943038e-003,   3.48747627e-009,   1.80251583e-011,
      5.43472210e-323],
   [  4.70221520e-003,   1.14943038e-003,   3.90545096e-008,
      2.63846519e-010],
   [  5.92878775e-323,   8.43322640e-003,   1.74911362e-003,
      3.15465136e-009],
   [  1.04009792e-010,   6.42285340e-323,   8.43322640e-003,
      1.74911362e-003],
   [  2.56120209e-010,   4.15414486e-011,   6.91691904e-323,
      8.43322640e-003],
   [  1.74911362e-003,   3.43575891e-009,   1.12286848e-010,
      7.41098469e-323],
   [  8.43322640e-003,   1.74911362e-003,   5.91887557e-009,
      1.45863583e-010],
   [  7.90505033e-323,   8.43322640e-003,   1.74911362e-003,
      3.34205639e-009],
   [  1.07133209e-010,   8.39911598e-323,   8.43322640e-003,
      1.74911362e-003],
   [  1.21188587e-009,   7.07453993e-011,   8.89318163e-323,
      8.43322640e-003],
   [  1.74911362e-003,   3.38890765e-009,   1.12130677e-010,
      9.38724727e-323],
   [  8.43322640e-003,   1.74911362e-003,   1.79596488e-009,
      8.38637515e-011]])

Upvotes: 2

Views: 35

Answers (1)

jezrael
jezrael

Reputation: 862611

You can use syntacting sugar - .groupby with Series:

res[3] = ((x[2] + x[3]).groupby(x[0]).max() - (x[2] - x[3]).groupby(x[0]).min())*.5
print (res)
     1    2    3
0               
1  1.0  5.5  6.0
2  2.0  5.0  5.5

I get this timigs with you array:

In [279]: %%timeit
     ...: res    = x.groupby(0).mean()
     ...: res[3] = ((x[2] + x[3]).groupby(x[0]).max() - (x[2] - x[3]).groupby(x[0]).min())*.5
     ...: 
4.26 ms ± 62.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [280]: %%timeit
     ...: g      = x.groupby(0)
     ...: res    = g.mean()
     ...: res[3] = g.apply(lambda x: ((x[2]+x[3]).max()-(x[2]-x[3]).min())*0.5)
     ...: 
11 ms ± 76.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Also if turn off sorting by grouping column if possible:

In [283]: %%timeit
     ...: res    = x.groupby(0, sort=False).mean()
     ...: res[3] = ((x[2] + x[3]).groupby(x[0], sort=False).max() - (x[2] - x[3]).groupby(x[0], sort=False).min())*.5
     ...: 
4.1 ms ± 50.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Upvotes: 2

Related Questions