Eran Moshe
Eran Moshe

Reputation: 3208

Python - Sample from 'x' lists with different distribution

In the following code, I've created a list of items and users. I've separated the items to 3 different lists of very popular, popular and regular items.

import numpy as np


N_USERS = 20000
N_ITEMS = 1000

items = range(0, N_ITEMS)
users = range(0, N_USERS)

vpop = int(len(items)*0.1)
pop = int(len(items)*0.3)

np.random.shuffle(items)
vpop_items = items[:vpop]
pop_items = items[vpop:pop]
reg_items = items [pop:]

I want to sample X samples from those lists with different distribution. For example:

list_of_items = sample(vpop_items, pop_items, reg_items, p = [0.5, 0.35, 0.15], X)

where X is the number of samples I want to make, and P is the list of distributions corresponding with the lists (vpop_items, pop_items, reg_items).

So in the end I will have X "items" in list_of_items.

Let's say X = 100. I want 100 samples in total, with 0.5 chance from vpop_items, 0.35 chance from pop_items, and 0.15 chance from reg_items. The sampling must be done without replacement, that is, no item can be selected more than once.

Upvotes: 1

Views: 770

Answers (2)

Here's a straightforward numpy solution that makes use of the fact that you only have three categories. This probably doesn't scale well for too many categories, as it just loops over the three choices.

First, generate uniform pseudorandoms to determine how many samples to take from which group. Then, use numpy.random.choice to perform the sampling:

import numpy as np

# data setup
N_ITEMS = 1000

items = list(range(0, N_ITEMS)) #python 3

vpop = int(len(items)*0.1)
pop = int(len(items)*0.3)

np.random.shuffle(items)
vpop_items = items[:vpop]
pop_items = items[vpop:pop]
reg_items = items[pop:]

# actual answer
def randsample(data1,data2,data3,probs,samples):
    # "samples" is the number of samples to take
    uniforms = np.random.rand(samples)
    inds1 = uniforms<=probs[0]
    inds2 = (probs[0]<uniforms) & (uniforms<=probs[0]+probs[1])
    inds3 = ~(inds1|inds2)

    output = np.empty(samples,dtype=type(data1[0])) #set dtype
    for ind,dat in zip((inds1,inds2,inds3),(data1,data2,data3)):
        output[ind] = np.random.choice(dat,ind.sum(),replace=False)

    #TODO: guard against depletion of one of the data sources...

    return output

res = randsample(vpop_items, pop_items, reg_items, [0.5, 0.35, 0.15], 100)

The array uniforms contains a pseudorandom uniform number between 0 and 1 for each sample point. We compare these numbers to the (cumulative) probabilities given in the input, in order to choose from the respective categories with the prescribed probability. Generally, for a given sample we choose from type i if the corresponding pseudorandom number is between sum(probs[:i]) and sum(probs[:i+1]). The three index arrays inds1,inds2,inds3 give a disjoint partitioning of our output samples, specifying the type of category for the given sample point. Then all we have to do is to set the corresponding indices of the output array based on a random choice from the given category.

Just to check that the resulting samples are correct and representative:

>>> np.in1d(res, vpop_items).sum()/res.size
0.53000000000000003
>>> np.in1d(res, pop_items).sum()/res.size
0.34000000000000002
>>> np.in1d(res, reg_items).sum()/res.size
0.13
>>> (np.in1d(res, reg_items) & np.in1d(res,pop_items)).sum()
0
>>> (np.in1d(res, reg_items) & np.in1d(res,vpop_items)).sum()
0
>>> (np.in1d(res, pop_items) & np.in1d(res,vpop_items)).sum()
0

Upvotes: 1

PM 2Ring
PM 2Ring

Reputation: 55499

Here's a plain Python algorithm that does what you want. It's more efficient than what you're currently doing, but I'm sure there's a smarter way to do it. :)

Let num be the total number of samples wanted. We first generate num random numbers in the range 0 - 1, and test them against the desired cumulative probabilities, keeping count of how many numbers occur in each probability range. Next, we sample each sequence, using the counts we found in the first step as the sample size. Finally, we shuffle those samples together.

In the code below I've commented out the lines that do the shuffling to make it easier to see what's going on while testing the code.

from random import seed, random, sample, shuffle
from itertools import accumulate

def multi_sample(seqs, probs, num):
    ''' Sample from each sequence in list/tuple `seqs` with the corresponding 
        probability in list/tuple `probs`. Return a list containing `num` samples
    '''
    # Compute the cumulative probability
    # This really should raise ValueError if aprobs[-1] != 1.0
    # and we ought to check that len(seqs) == len(probs)...
    aprobs = list(accumulate(probs))

    # Determine how many samples to take from each seq
    counts = [0] * len(seqs)
    for _ in range(num):
        x = random()
        for i, p in enumerate(aprobs):
            if x < p:
                break
        counts[i] += 1

    lst = []
    for seq, count in zip(seqs, counts):
        lst.extend(sample(seq, count))

    #shuffle(lst)
    return lst

# Test

N_ITEMS = 1000
items = list(range(N_ITEMS))
vpop = int(N_ITEMS * 0.1)
pop = int(N_ITEMS * 0.3)

#shuffle(items)
vpop_items = items[:vpop]
pop_items = items[vpop:pop]
reg_items = items[pop:]

all_items = (vpop_items, pop_items, reg_items)

list_of_items = multi_sample(all_items, probs=[0.5, 0.35, 0.15], num=100)
print(list_of_items)

# Verify

#list_of_items.sort()
#print(list_of_items)

# Should be ~50
print(sum(1 for x in list_of_items if x < vpop))
# Should be ~35
print(sum(1 for x in list_of_items if vpop <= x < pop))

typical output

[65, 16, 81, 97, 30, 33, 52, 92, 96, 72, 50, 4, 75, 7, 44, 18, 90, 9, 91, 56, 85, 28, 84, 88, 76, 21, 14, 77, 8, 59, 22, 34, 93, 95, 63, 10, 99, 41, 60, 36, 66, 2, 13, 64, 51, 43, 11, 106, 153, 235, 189, 132, 150, 226, 196, 247, 245, 194, 172, 227, 202, 256, 163, 205, 131, 192, 295, 147, 246, 108, 291, 155, 128, 171, 141, 124, 102, 210, 294, 284, 276, 148, 122, 290, 948, 566, 894, 884, 310, 476, 562, 313, 357, 846, 794, 317, 335, 599, 370, 988]
47
37

Be aware that this function can fail: if you call sample(seq, count) where count > len(seq) it will raise ValueError: Sample larger than population. So you need to ensure that num is small enough so that that can't occur. To be totally safe, make sure that num is <= than the length of the smallest sequence. With the given data, num is 100 and the smallest sequence is vpop_items, which contains 100 items, so we don't need to worry.

Thanks to Andras Deak for bringing this important point to my attention.


As I said earlier, there's bound to be a smarter way to do this: rather than calculating counts in a loop we should be able to just generate those counts directly using appropriate mathematics, but I'm afraid I don't know (or don't remember) how to do that. Of course, we could "cheat". :) Using the given data, we want approximately 50 items from vpop_items, 35 items from pop_items and the remaining 15 items from reg_items. So we could set counts to [50, 35, 15] and then make a small random adjustment to each count, taking care to keep the total equal to 100.

Upvotes: 2

Related Questions