Pandas stratified sampling by count

Question

I want to create a sample column that will evenly select vcount by sId and cId:

df = pd.DataFrame({'sId': {0: 's0', 1: 's0', 2: 's1', 3: 's1', 4: 's2', 5: 's2', 6: 's2', 7: 's2', 8: 's3', 9: 's3', 10: 's3', 11: 's3', 12: 's3'}, 'cId': {0: 'c0', 1: 'c1', 2: 'c2', 3: 'c3', 4: 'c4', 5: 'c5', 6: 'c6', 7: 'c7', 8: 'c8', 9: 'c9', 10: 'c10', 11: 'c11', 12: 'c12'}, 'vcount': {0: 322, 1: 168, 2: 1818, 3: 81, 4: 13114, 5: 5, 6: 3, 7: 2, 8: 1979, 9: 1561, 10: 1548, 11: 1009, 12: 11}})

      sId      cId     vcount
0      s0       c0     322
1      s0       c1     168
2      s1       c2    1818
3      s1       c3      81
4      s2       c4   13114
5      s2       c5       5
6      s2       c6       3
7      s2       c7       2
8      s3       c8    1979
9      s3       c9    1561
10     s3      c10    1548
11     s3      c11    1009
12     s3      c12      11

Right now I need it to work for sample 100, expected output

      sId      cId  vcount  sample
0      s0       c0     322      50
1      s0       c1     168      50
2      s1       c2    1818      50
3      s1       c3      81      50
4      s2       c4   13114      90
5      s2       c5       5       5
6      s2       c6       3       3
7      s2       c7       2       2
8      s3       c8    1979      22
9      s3       c9    1561      22
10     s3      c10    1548      22
11     s3      c11    1009      23
12     s3      c12      11      11

As you can see for the sId s2 there are 4 cIds, so we would want 25 from each cIds; however one 1 has more than 25 so we have to select all other cIds and get the remaining from c4. Similarly s0 has 2 cIds so we want 50 each and there are more than 50 samples from each cId. For s3 it doesn't matter which one get's the largest sample, I just need the distribution to be as uniform as possible.

The goal is to select all of the cId for each sId and divide the 100 as evenly as possible.

I couldn't figure this out and manually typed in the sample column; however that isn't a reasonable solution when the list gets larger.

Henry Ecker · Accepted Answer

Try something like:

Grab near_split from this SO Quesion Split an integer into bins.
Groupby sId and apply get_sample
Prime sample column with values from vcount.
Create mask of vcount where less than total_sample / rows in group
Get Series of values from vcount where values less than min sample
Assign to sample on negation of the mask (where vcount is GTE than the min sample) to an even distribution of the remaining samples.

import pandas as pd

df = pd.DataFrame({'sId': {0: 's0', 1: 's0', 2: 's1', 3: 's1',
                           4: 's2', 5: 's2', 6: 's2', 7: 's2',
                           8: 's3', 9: 's3', 10: 's3', 11: 's3',
                           12: 's3'},
                   'cId': {0: 'c0', 1: 'c1', 2: 'c2', 3: 'c3',
                           4: 'c4', 5: 'c5', 6: 'c6', 7: 'c7',
                           8: 'c8', 9: 'c9', 10: 'c10', 11: 'c11',
                           12: 'c12'},
                   'vcount': {0: 322, 1: 168, 2: 1818, 3: 81,
                              4: 13114, 5: 5, 6: 3, 7: 2, 8: 1979,
                              9: 1561, 10: 1548, 11: 1009,
                              12: 11}})

# Control Variables
total_sample = 100


def near_split(x, num_bins):
    if num_bins <= 0:
        return
    quotient, remainder = divmod(x, num_bins)
    return [quotient + 1] * remainder + [quotient] * (num_bins - remainder)


def get_sample(g):
    # How Many Values In Group
    rows = len(g)
    # Prime sample with values of vcount
    g['sample'] = g['vcount']
    # Get locations Where vcount is less than number of samples
    lt_mask = g['vcount'] < (total_sample / rows)
    # Get Series of vcount that match lt_mask
    lt_s = g.loc[lt_mask, 'vcount']
    # Sum lt_s and subtract from total_sample to get remaining
    # Distribute remaining evenly among GTE rows
    # Set ~lt_mask sample to the calculated distribution
    g.loc[~lt_mask, 'sample'] = \
        near_split(total_sample - lt_s.sum(), rows - len(lt_s))
    return g


new_df = df.groupby('sId').apply(get_sample)

# For Display
print(new_df)

Output:

   sId  cId  vcount  sample
0   s0   c0     322      50
1   s0   c1     168      50
2   s1   c2    1818      50
3   s1   c3      81      50
4   s2   c4   13114      90
5   s2   c5       5       5
6   s2   c6       3       3
7   s2   c7       2       2
8   s3   c8    1979      23
9   s3   c9    1561      22
10  s3  c10    1548      22
11  s3  c11    1009      22
12  s3  c12      11      11

Pandas stratified sampling by count

Answers (1)

Related Questions