Pandas dataframe sample based on condition and sample size

Question

Code:

import pandas as pd

df = pd.DataFrame({'data': list(range(100))})

I want to take a sample of size 20, such that 80% of the elements are between 0 and 10, and 20% are between 50 and 70. (randomly sampled).

I want a method which works for arbitrary number of conditions.

My idea which works but is not clean: sample everything which is between 0 and 10 and take 80% * 20 random rows, do the same for the rest of the values, and concatenate. Is there a pandas built-in which I can use, because this does not scale well for larger number of conditions?

mozway · Accepted Answer

You can use bins and a dictionary of the proportions to sample.

      # ┌─0─┐┌─1─┐┌─2─┐┌─3───┐
bins = [-1, 10, 50, 70, float('inf')]
fraction = {0: 0.8, 2: 0.2} # group 0 is -1-10, group 2 is 50-70
size = 20

groups = pd.cut(df['data'], bins=bins, labels=range(len(bins)-1))

sampled = (df
  .groupby(groups)['data']
  .apply(lambda g: g.sample(n=int(fraction.get(g.name, 0)*size),
                            replace=True)
        )
  #.droplevel(0)
 )

NB. I used replace=True in sample here as this would be impossible otherwise to get 16 unique elements from the group 0-10, but you can change that in your real data of the condition is safe. Also, add .droplevel(0) to remove the group id.

output:

data    
0     6      6
      8      8
      2      2
      3      3
      0      0
      0      0
      6      6
      0      0
      6      6
      3      3
      10    10
      3      3
      8      8
      8      8
      8      8
      2      2
2     54    54
      53    53
      62    62
      64    64
Name: data, dtype: int64

Pandas dataframe sample based on condition and sample size

Answers (1)

Related Questions