Reputation: 9033
In Pandas, I want to:
As long as I increase items_per_chunk but keep nr_of_chunks constant, the histogram of the means of each chunk should plot as a narrowing bell curve.
I came up with the following Pandas, Numpy, Seaborn approach, which looks inefficient or not very clever to me:
%matplotlib inline
import pandas as pd
import seaborn as sns
import numpy as np
sns.set()
df = pd.read_csv('../data/data.csv')
nr_of_chunks = 1000
for items_per_chunk in [1, 5, 20]:
sample = df.sample(nr_of_chunks * items_per_chunk)
chunks = np.array_split(sample, nr_of_chunks)
mean_of_chunks = [chunk.mean() for chunk in chunks]
sns.distplot(mean_of_chunks)
Output:
Is there a better way to do it? For example, I expect there is a way to directly apply the mean function to each chunk while splitting the sample.
Upvotes: 0
Views: 61
Reputation: 30589
After resetting the index of sample
to a regular RangeIndex
, you can simply group
by the index floor-divided by items_per_chunk
:
import pandas as pd
import seaborn as sns
sns.set()
df = pd.read_csv('../data/data.csv')
nr_of_chunks = 1000
for items_per_chunk in [1, 5, 20]:
sample = df.sample(nr_of_chunks * items_per_chunk).reset_index(drop=True)
mean_of_chunks = sample.groupby(sample.index // items_per_chunk).mean()
sns.distplot(mean_of_chunks)
Upvotes: 1