Reputation: 63727
Consider a population with skewed class distribution as in
ErrorType Samples
1 XXXXXXXXXXXXXXX
2 XXXXXXXX
3 XX
4 XXX
5 XXXXXXXXXXXX
I would like to randomly sample 20 out of 40 without undersampling any of the classes with smaller participation. For example in the above case, I would want to sample as follows
ErrorType Samples
1 XXXXX|XXXXXXXXXX
2 XXXXX|XXX
3 XX***|
4 XXX**|
5 XXXXX|XXXXXXX
i.e. 5 of Type -1 and -2 and -3, 2 of Type -3 and 3 of Type -4
I ended up writing a circumlocutious code, but I believe there can be an easier way to utilize pandas methods or some sklearn functions.
sample_size = 20 # Just for the example
# Determine the average participaction per error types
avg_items = sample_size / len(df.ErrorType.unique())
value_counts = df.ErrorType.value_counts()
less_than_avg = value_counts[value_counts < avg_items]
offset = avg_items * len(value_counts[value_counts < avg_items]) - sum(less_than_avg)
offset_per_item = offset / (len(value_counts) - len(less_than_avg))
adj_avg = int(non_act_count / len(value_counts) + offset_per_item)
df = df.groupby(['ErrorType'],
group_keys=False).apply(lambda g: g.sample(min(adj_avg, len(g)))))
Upvotes: 5
Views: 2803
Reputation: 21643
No magic numbers. Simply sample from the entire population, coded in an obvious way.
The first step is to replace each 'X' with the numeric code of the stratum in which it appears. Thus coded, the entire population is stored in one string, called entire_population
.
>>> strata = {}
>>> with open('skewed.txt') as skewed:
... _ = next(skewed)
... for line in skewed:
... error_type, samples = line.rstrip().split()
... strata[error_type] = samples
...
>>> whole = []
>>> for _ in strata:
... strata[_] = strata[_].replace('X', _)
... _, strata[_]
... whole.append(strata[_])
...
('3', '33')
('2', '22222222')
('1', '111111111111111')
('5', '555555555555')
('4', '444')
>>> entire_population = ''.join(whole)
Given the constraint that the sample_size
must be 20, randomly sample from the entire population to form a complete sample.
>>> sample = []
>>> sample_size = 20
>>> from random import choice
>>> for s in range(sample_size):
... sample.append(choice(entire_population))
...
>>> sample
['2', '5', '1', '5', '1', '1', '1', '3', '5', '5', '5', '1', '5', '2', '5', '1', '2', '2', '2', '5']
Finally, characterise the sample as a sampling design by counting the representatives on each stratum in it.
>>> from collections import Counter
>>> Counter(sample)
Counter({'5': 8, '1': 6, '2': 5, '3': 1})
Upvotes: 0
Reputation: 30605
You can make use of a helper column to find samples with length more than the sample size and use pd.Series.sample
i.e
Example :
df = pd.DataFrame({'ErrorType':[1,2,3,4,5],
'Samples':[np.arange(100),np.arange(10),np.arange(3),np.arange(2),np.arange(100)]})
df['new'] =df['Samples'].str.len().where(df['Samples'].str.len()<5,5)
# this is let us know how many samples can be extracted per row
#0 5
#1 5
#2 3
#3 2
#4 5
Name: new, dtype: int64
# Sampling based on newly obtained column i.e
df.apply(lambda x : pd.Series(x['Samples']).sample(x['new']).tolist(),1)
0 [52, 81, 43, 60, 46]
1 [8, 7, 0, 9, 1]
2 [2, 1, 0]
3 [1, 0]
4 [29, 24, 16, 15, 69]
Name: sample2, dtype: object
I wrote a function to return the sample sizes with thresh i.e
def get_thres_arr(sample_size,sample_length):
thresh = sample_length.min()
size = np.array([thresh]*len(sample_length))
sum_of_size = sum(size)
while sum_of_size< sample_size:
# If the lenght is more than threshold then increase the thresh by 1 i.e
size = np.where(sample_length>thresh,thresh+1,sample_length)
sum_of_size = sum(size)
#increment threshold
thresh+=1
return size
df = pd.DataFrame({'ErrorType':[1,2,3,4,5,1,7,9,4,5],
'Samples':[np.arange(100),np.arange(10),np.arange(3),np.arange(2),np.arange(100),np.arange(100),np.arange(10),np.arange(3),np.arange(2),np.arange(100)]})
ndf = pd.DataFrame({'ErrorType':[1,2,3,4,5,6],
'Samples':[np.arange(100),np.arange(10),np.arange(3),np.arange(1),np.arange(2),np.arange(100)]})
get_thres_arr(20,ndf['Samples'].str.len())
#array([5, 5, 3, 1, 2, 5])
get_thres_arr(20,df['Samples'].str.len())
#array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
Now you get the sizes you can use :
df['new'] = get_thres_arr(20,df['Samples'].str.len())
df.apply(lambda x : pd.Series(x['Samples']).sample(x['new']).tolist(),1)
0 [64, 89]
1 [4, 0]
2 [0, 1]
3 [1, 0]
4 [41, 80]
5 [25, 84]
6 [4, 0]
7 [2, 0]
8 [1, 0]
9 [34, 1]
Hope it helps.
Upvotes: 2
Reputation: 5355
Wow. Got nerd sniped on this one. I've written a function that will do what you want in numpy, without any magic numbers.... it's not pretty , but I couldn't waste all that time writing something and not post it as an answer. Now there's two outputs n_for_each_label
and random_idxs
which are the number of selections to make for each class and the randomly selected data respectively. I can't think why you would want n_for_each_label
when you have random_idxs
though.
EDIT: As far as I'm aware there is no functionality to do this in scikit, it's not a very common way to dice up your data for ML so I doubt there is anything.
# This is your input, sample size and your labels
sample_size = 20
# in your case you'd just want y = df.ErrorType
y = np.hstack((np.ones(15), np.ones(8)*2,
np.ones(2)*3, np.ones(3)*4,
np.ones(12)*5))
y = y.astype(int)
# y = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2,
# 3, 3, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5]
# Below is the function
unique_labels = np.unique(y)
bin_c = np.bincount(y)[unique_labels]
label_mat = np.ones((bin_c.shape[0], bin_c.max()), dtype=int)*-1
for i in range(unique_labels.shape[0]):
label_loc = np.where(y == unique_labels[i])[0]
np.random.shuffle(label_loc)
label_mat[i, :label_loc.shape[0]] = label_loc
random_size = 0
i = 1
while random_size < sample_size:
i += 1
random_size = np.sum(label_mat[:, :i] != -1)
if random_size == sample_size:
random_idxs = label_mat[:, :i]
n_for_each_label = np.sum(random_idxs != -1, axis=1)
random_idxs = random_idxs[random_idxs != -1]
else:
random_idxs = label_mat[:, :i]
last_idx = np.where(random_idxs[:, -1] != -1)[0]
n_drop = random_size - sample_size
drop_idx = np.random.choice(last_idx, n_drop)
random_idxs[drop_idx, -1] = -1
n_for_each_label = np.sum(random_idxs != -1, axis=1)
random_idxs = random_idxs[random_idxs != -1]
Ouput:
n_for_each_label = array([5, 5, 2, 3, 5])
The number from each of your error types to sample, or if you want to skip to the end:
random_idxs = array([ 3, 11, 8, 13, 9, 22, 15, 17, 20, 18, 23, 24, 25, 26, 27, 36, 32, 38, 35, 33])
Upvotes: 1