Efficiently split pandas dataframe and apply method to sub-sets

Question

In Pandas, I want to:

randomly select a sample from a dataframe (with a single column)
split this sample into nr_of_chunks chunks with each chunk containing items_per_chunk
compute the mean of each chunk
and plot it into a histogram

As long as I increase items_per_chunk but keep nr_of_chunks constant, the histogram of the means of each chunk should plot as a narrowing bell curve.

I came up with the following Pandas, Numpy, Seaborn approach, which looks inefficient or not very clever to me:

%matplotlib inline

import pandas as pd
import seaborn as sns
import numpy as np
sns.set()

df = pd.read_csv('../data/data.csv')

nr_of_chunks = 1000

for items_per_chunk in [1, 5, 20]:
  sample = df.sample(nr_of_chunks * items_per_chunk)
  chunks = np.array_split(sample, nr_of_chunks)
  mean_of_chunks = [chunk.mean() for chunk in chunks]

  sns.distplot(mean_of_chunks)

Output:

Is there a better way to do it? For example, I expect there is a way to directly apply the mean function to each chunk while splitting the sample.

Stef · Accepted Answer

After resetting the index of sample to a regular RangeIndex, you can simply group by the index floor-divided by items_per_chunk:

import pandas as pd
import seaborn as sns
sns.set()

df = pd.read_csv('../data/data.csv')

nr_of_chunks = 1000

for items_per_chunk in [1, 5, 20]:
  sample = df.sample(nr_of_chunks * items_per_chunk).reset_index(drop=True)
  mean_of_chunks = sample.groupby(sample.index // items_per_chunk).mean()

  sns.distplot(mean_of_chunks)

Efficiently split pandas dataframe and apply method to sub-sets

Answers (1)

Related Questions