Reputation: 14094
I want to create a sample column that will evenly select vcount
by sId
and cId
:
df = pd.DataFrame({'sId': {0: 's0', 1: 's0', 2: 's1', 3: 's1', 4: 's2', 5: 's2', 6: 's2', 7: 's2', 8: 's3', 9: 's3', 10: 's3', 11: 's3', 12: 's3'}, 'cId': {0: 'c0', 1: 'c1', 2: 'c2', 3: 'c3', 4: 'c4', 5: 'c5', 6: 'c6', 7: 'c7', 8: 'c8', 9: 'c9', 10: 'c10', 11: 'c11', 12: 'c12'}, 'vcount': {0: 322, 1: 168, 2: 1818, 3: 81, 4: 13114, 5: 5, 6: 3, 7: 2, 8: 1979, 9: 1561, 10: 1548, 11: 1009, 12: 11}})
sId cId vcount
0 s0 c0 322
1 s0 c1 168
2 s1 c2 1818
3 s1 c3 81
4 s2 c4 13114
5 s2 c5 5
6 s2 c6 3
7 s2 c7 2
8 s3 c8 1979
9 s3 c9 1561
10 s3 c10 1548
11 s3 c11 1009
12 s3 c12 11
Right now I need it to work for sample 100, expected output
sId cId vcount sample
0 s0 c0 322 50
1 s0 c1 168 50
2 s1 c2 1818 50
3 s1 c3 81 50
4 s2 c4 13114 90
5 s2 c5 5 5
6 s2 c6 3 3
7 s2 c7 2 2
8 s3 c8 1979 22
9 s3 c9 1561 22
10 s3 c10 1548 22
11 s3 c11 1009 23
12 s3 c12 11 11
As you can see for the sId s2 there are 4 cIds, so we would want 25 from each cIds; however one 1 has more than 25 so we have to select all other cIds and get the remaining from c4. Similarly s0 has 2 cIds so we want 50 each and there are more than 50 samples from each cId. For s3 it doesn't matter which one get's the largest sample, I just need the distribution to be as uniform as possible.
The goal is to select all of the cId
for each sId
and divide the 100 as evenly as possible.
I couldn't figure this out and manually typed in the sample column; however that isn't a reasonable solution when the list gets larger.
Upvotes: 0
Views: 386
Reputation: 35626
Try something like:
near_split
from this SO Quesion Split an integer into bins.sId
and apply get_sample
sample
column with values from vcount
.vcount
where less than total_sample
/ rows
in groupvcount
where values less than min samplesample
on negation of the mask (where vcount
is GTE than the min sample) to an even distribution of the remaining samples.import pandas as pd
df = pd.DataFrame({'sId': {0: 's0', 1: 's0', 2: 's1', 3: 's1',
4: 's2', 5: 's2', 6: 's2', 7: 's2',
8: 's3', 9: 's3', 10: 's3', 11: 's3',
12: 's3'},
'cId': {0: 'c0', 1: 'c1', 2: 'c2', 3: 'c3',
4: 'c4', 5: 'c5', 6: 'c6', 7: 'c7',
8: 'c8', 9: 'c9', 10: 'c10', 11: 'c11',
12: 'c12'},
'vcount': {0: 322, 1: 168, 2: 1818, 3: 81,
4: 13114, 5: 5, 6: 3, 7: 2, 8: 1979,
9: 1561, 10: 1548, 11: 1009,
12: 11}})
# Control Variables
total_sample = 100
def near_split(x, num_bins):
if num_bins <= 0:
return
quotient, remainder = divmod(x, num_bins)
return [quotient + 1] * remainder + [quotient] * (num_bins - remainder)
def get_sample(g):
# How Many Values In Group
rows = len(g)
# Prime sample with values of vcount
g['sample'] = g['vcount']
# Get locations Where vcount is less than number of samples
lt_mask = g['vcount'] < (total_sample / rows)
# Get Series of vcount that match lt_mask
lt_s = g.loc[lt_mask, 'vcount']
# Sum lt_s and subtract from total_sample to get remaining
# Distribute remaining evenly among GTE rows
# Set ~lt_mask sample to the calculated distribution
g.loc[~lt_mask, 'sample'] = \
near_split(total_sample - lt_s.sum(), rows - len(lt_s))
return g
new_df = df.groupby('sId').apply(get_sample)
# For Display
print(new_df)
Output:
sId cId vcount sample
0 s0 c0 322 50
1 s0 c1 168 50
2 s1 c2 1818 50
3 s1 c3 81 50
4 s2 c4 13114 90
5 s2 c5 5 5
6 s2 c6 3 3
7 s2 c7 2 2
8 s3 c8 1979 23
9 s3 c9 1561 22
10 s3 c10 1548 22
11 s3 c11 1009 22
12 s3 c12 11 11
Upvotes: 2