Reputation: 3208
In the following code, I've created a list of items and users. I've separated the items to 3 different lists of very popular, popular and regular items.
import numpy as np
N_USERS = 20000
N_ITEMS = 1000
items = range(0, N_ITEMS)
users = range(0, N_USERS)
vpop = int(len(items)*0.1)
pop = int(len(items)*0.3)
np.random.shuffle(items)
vpop_items = items[:vpop]
pop_items = items[vpop:pop]
reg_items = items [pop:]
I want to sample X
samples from those lists with different distribution. For example:
list_of_items = sample(vpop_items, pop_items, reg_items, p = [0.5, 0.35, 0.15], X)
where X
is the number of samples I want to make, and P
is the list of distributions corresponding with the lists (vpop_items
, pop_items
, reg_items
).
So in the end I will have X
"items" in list_of_items
.
Let's say X = 100
. I want 100 samples in total, with 0.5 chance from vpop_items
, 0.35 chance from pop_items
, and 0.15 chance from reg_items
. The sampling must be done without replacement, that is, no item can be selected more than once.
Upvotes: 1
Views: 770
Reputation: 35175
Here's a straightforward numpy solution that makes use of the fact that you only have three categories. This probably doesn't scale well for too many categories, as it just loops over the three choices.
First, generate uniform pseudorandoms to determine how many samples to take from which group. Then, use numpy.random.choice
to perform the sampling:
import numpy as np
# data setup
N_ITEMS = 1000
items = list(range(0, N_ITEMS)) #python 3
vpop = int(len(items)*0.1)
pop = int(len(items)*0.3)
np.random.shuffle(items)
vpop_items = items[:vpop]
pop_items = items[vpop:pop]
reg_items = items[pop:]
# actual answer
def randsample(data1,data2,data3,probs,samples):
# "samples" is the number of samples to take
uniforms = np.random.rand(samples)
inds1 = uniforms<=probs[0]
inds2 = (probs[0]<uniforms) & (uniforms<=probs[0]+probs[1])
inds3 = ~(inds1|inds2)
output = np.empty(samples,dtype=type(data1[0])) #set dtype
for ind,dat in zip((inds1,inds2,inds3),(data1,data2,data3)):
output[ind] = np.random.choice(dat,ind.sum(),replace=False)
#TODO: guard against depletion of one of the data sources...
return output
res = randsample(vpop_items, pop_items, reg_items, [0.5, 0.35, 0.15], 100)
The array uniforms
contains a pseudorandom uniform number between 0 and 1 for each sample point. We compare these numbers to the (cumulative) probabilities given in the input, in order to choose from the respective categories with the prescribed probability. Generally, for a given sample we choose from type i
if the corresponding pseudorandom number is between sum(probs[:i])
and sum(probs[:i+1])
. The three index arrays inds1,inds2,inds3
give a disjoint partitioning of our output samples, specifying the type of category for the given sample point. Then all we have to do is to set the corresponding indices of the output array based on a random choice from the given category.
Just to check that the resulting samples are correct and representative:
>>> np.in1d(res, vpop_items).sum()/res.size
0.53000000000000003
>>> np.in1d(res, pop_items).sum()/res.size
0.34000000000000002
>>> np.in1d(res, reg_items).sum()/res.size
0.13
>>> (np.in1d(res, reg_items) & np.in1d(res,pop_items)).sum()
0
>>> (np.in1d(res, reg_items) & np.in1d(res,vpop_items)).sum()
0
>>> (np.in1d(res, pop_items) & np.in1d(res,vpop_items)).sum()
0
Upvotes: 1
Reputation: 55499
Here's a plain Python algorithm that does what you want. It's more efficient than what you're currently doing, but I'm sure there's a smarter way to do it. :)
Let num
be the total number of samples wanted. We first generate num
random numbers in the range 0 - 1, and test them against the desired cumulative probabilities, keeping count of how many numbers occur in each probability range. Next, we sample each sequence, using the counts we found in the first step as the sample size. Finally, we shuffle those samples together.
In the code below I've commented out the lines that do the shuffling to make it easier to see what's going on while testing the code.
from random import seed, random, sample, shuffle
from itertools import accumulate
def multi_sample(seqs, probs, num):
''' Sample from each sequence in list/tuple `seqs` with the corresponding
probability in list/tuple `probs`. Return a list containing `num` samples
'''
# Compute the cumulative probability
# This really should raise ValueError if aprobs[-1] != 1.0
# and we ought to check that len(seqs) == len(probs)...
aprobs = list(accumulate(probs))
# Determine how many samples to take from each seq
counts = [0] * len(seqs)
for _ in range(num):
x = random()
for i, p in enumerate(aprobs):
if x < p:
break
counts[i] += 1
lst = []
for seq, count in zip(seqs, counts):
lst.extend(sample(seq, count))
#shuffle(lst)
return lst
# Test
N_ITEMS = 1000
items = list(range(N_ITEMS))
vpop = int(N_ITEMS * 0.1)
pop = int(N_ITEMS * 0.3)
#shuffle(items)
vpop_items = items[:vpop]
pop_items = items[vpop:pop]
reg_items = items[pop:]
all_items = (vpop_items, pop_items, reg_items)
list_of_items = multi_sample(all_items, probs=[0.5, 0.35, 0.15], num=100)
print(list_of_items)
# Verify
#list_of_items.sort()
#print(list_of_items)
# Should be ~50
print(sum(1 for x in list_of_items if x < vpop))
# Should be ~35
print(sum(1 for x in list_of_items if vpop <= x < pop))
typical output
[65, 16, 81, 97, 30, 33, 52, 92, 96, 72, 50, 4, 75, 7, 44, 18, 90, 9, 91, 56, 85, 28, 84, 88, 76, 21, 14, 77, 8, 59, 22, 34, 93, 95, 63, 10, 99, 41, 60, 36, 66, 2, 13, 64, 51, 43, 11, 106, 153, 235, 189, 132, 150, 226, 196, 247, 245, 194, 172, 227, 202, 256, 163, 205, 131, 192, 295, 147, 246, 108, 291, 155, 128, 171, 141, 124, 102, 210, 294, 284, 276, 148, 122, 290, 948, 566, 894, 884, 310, 476, 562, 313, 357, 846, 794, 317, 335, 599, 370, 988]
47
37
Be aware that this function can fail: if you call sample(seq, count)
where count > len(seq)
it will raise ValueError: Sample larger than population
. So you need to ensure that num
is small enough so that that can't occur. To be totally safe, make sure that num
is <= than the length of the smallest sequence. With the given data, num
is 100 and the smallest sequence is vpop_items
, which contains 100 items, so we don't need to worry.
Thanks to Andras Deak for bringing this important point to my attention.
As I said earlier, there's bound to be a smarter way to do this: rather than calculating counts
in a loop we should be able to just generate those counts directly using appropriate mathematics, but I'm afraid I don't know (or don't remember) how to do that. Of course, we could "cheat". :) Using the given data, we want approximately 50 items from vpop_items
, 35 items from pop_items
and the remaining 15 items from reg_items
. So we could set counts
to [50, 35, 15]
and then make a small random adjustment to each count, taking care to keep the total equal to 100.
Upvotes: 2