Abhijit
Abhijit

Reputation: 63727

Stratified random sampling with Population Balancing

Consider a population with skewed class distribution as in

     ErrorType   Samples
        1          XXXXXXXXXXXXXXX
        2          XXXXXXXX
        3          XX
        4          XXX
        5          XXXXXXXXXXXX

I would like to randomly sample 20 out of 40 without undersampling any of the classes with smaller participation. For example in the above case, I would want to sample as follows

     ErrorType   Samples
        1          XXXXX|XXXXXXXXXX
        2          XXXXX|XXX
        3          XX***|
        4          XXX**|
        5          XXXXX|XXXXXXX

i.e. 5 of Type -1 and -2 and -3, 2 of Type -3 and 3 of Type -4

  1. This guarantees I have sample of size as close to my target i.e. 20 samples
  2. None of the classes has under participation esp classes -3 and -4.

I ended up writing a circumlocutious code, but I believe there can be an easier way to utilize pandas methods or some sklearn functions.

 sample_size = 20 # Just for the example
 # Determine the average participaction per error types
 avg_items = sample_size / len(df.ErrorType.unique())
 value_counts = df.ErrorType.value_counts()
 less_than_avg = value_counts[value_counts < avg_items]
 offset = avg_items * len(value_counts[value_counts < avg_items]) - sum(less_than_avg)
 offset_per_item = offset / (len(value_counts) - len(less_than_avg))
 adj_avg = int(non_act_count / len(value_counts) + offset_per_item)
 df = df.groupby(['ErrorType'],
                 group_keys=False).apply(lambda g: g.sample(min(adj_avg, len(g)))))

Upvotes: 5

Views: 2803

Answers (3)

Bill Bell
Bill Bell

Reputation: 21643

No magic numbers. Simply sample from the entire population, coded in an obvious way.

The first step is to replace each 'X' with the numeric code of the stratum in which it appears. Thus coded, the entire population is stored in one string, called entire_population.

>>> strata = {}
>>> with open('skewed.txt') as skewed:
...     _ = next(skewed)
...     for line in skewed:
...         error_type, samples = line.rstrip().split()
...         strata[error_type] = samples
... 
>>> whole = []
>>> for _ in strata:
...     strata[_] = strata[_].replace('X', _)
...     _, strata[_]
...     whole.append(strata[_])
...     
('3', '33')
('2', '22222222')
('1', '111111111111111')
('5', '555555555555')
('4', '444')
>>> entire_population = ''.join(whole)

Given the constraint that the sample_size must be 20, randomly sample from the entire population to form a complete sample.

>>> sample = []
>>> sample_size = 20
>>> from random import choice
>>> for s in range(sample_size):
...     sample.append(choice(entire_population))
...     
>>> sample
['2', '5', '1', '5', '1', '1', '1', '3', '5', '5', '5', '1', '5', '2', '5', '1', '2', '2', '2', '5']

Finally, characterise the sample as a sampling design by counting the representatives on each stratum in it.

>>> from collections import Counter
>>> Counter(sample)
Counter({'5': 8, '1': 6, '2': 5, '3': 1})

Upvotes: 0

Bharath M Shetty
Bharath M Shetty

Reputation: 30605

You can make use of a helper column to find samples with length more than the sample size and use pd.Series.sample i.e

Example :

df = pd.DataFrame({'ErrorType':[1,2,3,4,5],
               'Samples':[np.arange(100),np.arange(10),np.arange(3),np.arange(2),np.arange(100)]})

df['new'] =df['Samples'].str.len().where(df['Samples'].str.len()<5,5)
# this is let us know how many samples can be extracted per row
#0    5
#1    5
#2    3
#3    2
#4    5
Name: new, dtype: int64
# Sampling based on newly obtained column i.e 
df.apply(lambda x : pd.Series(x['Samples']).sample(x['new']).tolist(),1)

0    [52, 81, 43, 60, 46]
1         [8, 7, 0, 9, 1]
2               [2, 1, 0]
3                  [1, 0]
4    [29, 24, 16, 15, 69]
Name: sample2, dtype: object

I wrote a function to return the sample sizes with thresh i.e

def get_thres_arr(sample_size,sample_length): 
    thresh = sample_length.min()
    size = np.array([thresh]*len(sample_length))
    sum_of_size = sum(size)
    while sum_of_size< sample_size:
        # If the lenght is more than threshold then increase the thresh by 1 i.e  
        size = np.where(sample_length>thresh,thresh+1,sample_length)
        sum_of_size = sum(size)
        #increment threshold
        thresh+=1
    return size

df = pd.DataFrame({'ErrorType':[1,2,3,4,5,1,7,9,4,5],
                   'Samples':[np.arange(100),np.arange(10),np.arange(3),np.arange(2),np.arange(100),np.arange(100),np.arange(10),np.arange(3),np.arange(2),np.arange(100)]})
ndf = pd.DataFrame({'ErrorType':[1,2,3,4,5,6],
                   'Samples':[np.arange(100),np.arange(10),np.arange(3),np.arange(1),np.arange(2),np.arange(100)]})


get_thres_arr(20,ndf['Samples'].str.len())
#array([5, 5, 3, 1, 2, 5])

get_thres_arr(20,df['Samples'].str.len())
#array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

Now you get the sizes you can use :

df['new'] = get_thres_arr(20,df['Samples'].str.len())
df.apply(lambda x : pd.Series(x['Samples']).sample(x['new']).tolist(),1)

0    [64, 89]
1      [4, 0]
2      [0, 1]
3      [1, 0]
4    [41, 80]
5    [25, 84]
6      [4, 0]
7      [2, 0]
8      [1, 0]
9     [34, 1]

Hope it helps.

Upvotes: 2

piman314
piman314

Reputation: 5355

Wow. Got nerd sniped on this one. I've written a function that will do what you want in numpy, without any magic numbers.... it's not pretty , but I couldn't waste all that time writing something and not post it as an answer. Now there's two outputs n_for_each_label and random_idxs which are the number of selections to make for each class and the randomly selected data respectively. I can't think why you would want n_for_each_label when you have random_idxs though.

EDIT: As far as I'm aware there is no functionality to do this in scikit, it's not a very common way to dice up your data for ML so I doubt there is anything.

# This is your input, sample size and your labels
sample_size = 20
# in your case you'd just want y = df.ErrorType
y = np.hstack((np.ones(15), np.ones(8)*2,
               np.ones(2)*3, np.ones(3)*4,
               np.ones(12)*5))
y = y.astype(int)
# y = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2,
 #     3, 3, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5]

# Below is the function
unique_labels = np.unique(y)
bin_c = np.bincount(y)[unique_labels]
label_mat = np.ones((bin_c.shape[0], bin_c.max()), dtype=int)*-1
for i in range(unique_labels.shape[0]):
    label_loc = np.where(y == unique_labels[i])[0]
    np.random.shuffle(label_loc)
    label_mat[i, :label_loc.shape[0]] = label_loc
random_size = 0
i = 1
while random_size < sample_size:
    i += 1
    random_size = np.sum(label_mat[:, :i] != -1)

if random_size == sample_size:
    random_idxs = label_mat[:, :i]
    n_for_each_label = np.sum(random_idxs != -1, axis=1)
    random_idxs = random_idxs[random_idxs != -1]
else:
    random_idxs = label_mat[:, :i]
    last_idx = np.where(random_idxs[:, -1] != -1)[0]
    n_drop = random_size - sample_size
    drop_idx = np.random.choice(last_idx, n_drop)
    random_idxs[drop_idx, -1] = -1
    n_for_each_label = np.sum(random_idxs != -1, axis=1)
    random_idxs = random_idxs[random_idxs != -1]

Ouput:

n_for_each_label = array([5, 5, 2, 3, 5])

The number from each of your error types to sample, or if you want to skip to the end:

random_idxs = array([ 3, 11, 8, 13, 9, 22, 15, 17, 20, 18, 23, 24, 25, 26, 27, 36, 32, 38, 35, 33])

Upvotes: 1

Related Questions