Sebi
Sebi

Reputation: 9033

Efficiently split pandas dataframe and apply method to sub-sets

In Pandas, I want to:

As long as I increase items_per_chunk but keep nr_of_chunks constant, the histogram of the means of each chunk should plot as a narrowing bell curve.

I came up with the following Pandas, Numpy, Seaborn approach, which looks inefficient or not very clever to me:

%matplotlib inline

import pandas as pd
import seaborn as sns
import numpy as np
sns.set()

df = pd.read_csv('../data/data.csv')

nr_of_chunks = 1000

for items_per_chunk in [1, 5, 20]:
  sample = df.sample(nr_of_chunks * items_per_chunk)
  chunks = np.array_split(sample, nr_of_chunks)
  mean_of_chunks = [chunk.mean() for chunk in chunks]

  sns.distplot(mean_of_chunks)

Output:

enter image description here

Is there a better way to do it? For example, I expect there is a way to directly apply the mean function to each chunk while splitting the sample.

Upvotes: 0

Views: 61

Answers (1)

Stef
Stef

Reputation: 30589

After resetting the index of sample to a regular RangeIndex, you can simply group by the index floor-divided by items_per_chunk:

import pandas as pd
import seaborn as sns
sns.set()

df = pd.read_csv('../data/data.csv')

nr_of_chunks = 1000

for items_per_chunk in [1, 5, 20]:
  sample = df.sample(nr_of_chunks * items_per_chunk).reset_index(drop=True)
  mean_of_chunks = sample.groupby(sample.index // items_per_chunk).mean()

  sns.distplot(mean_of_chunks)

Upvotes: 1

Related Questions