Reputation: 17154
How can we efficiently find the binned mean of a column in pandas dataframe?
I like to divide the column into 5 parts and find the mean of each part.
Here is what I did:
import numpy as np
import pandas as pd
df = pd.DataFrame({'x': np.arange(20)})
n_bins = 5
dfs = np.array_split(df,n_bins)
x_means = [x.mean()[0] for x in dfs]
n_elems = len(df) // n_bins
x_mean_lst = [[i]*n_elems for i in x_means]
x_mean_array = np.array(x_mean_lst).flatten()
df['x_bin_mean'] = x_mean_array
df
This seems more complicated than necessary. Are there any better alternatives?
The output should look like this:
x x_bin_mean
0 0 1.5
1 1 1.5
2 2 1.5
3 3 1.5
4 4 5.5
5 5 5.5
6 6 5.5
7 7 5.5
8 8 9.5
9 9 9.5
10 10 9.5
11 11 9.5
12 12 13.5
13 13 13.5
14 14 13.5
15 15 13.5
16 16 17.5
17 17 17.5
18 18 17.5
19 19 17.5
Upvotes: 2
Views: 109
Reputation: 402533
I'm guessing you want something like
df.groupby(df.index // (len(df) // n_bins))['x'].transform('mean')
or, if your index isn't numeric,
df.groupby(pd.RangeIndex(len(df)) // (len(df) // n_bins))['x'].transform('mean')
Here's what the grouper and output will look like for n_bins = 5
,
df.index // (len(df) // 5)
# Int64Index([0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4], dtype='int64')
df['x_bin_mean'] = (
df.groupby(df.index // (len(df) // 5))['x'].transform('mean'))
df.head(10)
x x_bin_mean
0 0 1.5
1 1 1.5
2 2 1.5
3 3 1.5
4 4 5.5
5 5 5.5
6 6 5.5
7 7 5.5
8 8 9.5
9 9 9.5
Note that integer division, while fast, may not handle cases where the index does not divide equally:
I'm not sure that the integer division is fully correct (if things don't divide evenly). For instance with a length of 16 and n_bins=5 you get 6 groups —Alollz
In this case, use Alollz's helpful suggestion of pd.qcut
:
df.groupby(pd.qcut(df.index, n_bins))['x'].transform('mean')
Upvotes: 6