Reputation: 63
Essentially what I'm trying to do is randomly select items from a list while maintaining the internal distribution. See the following example.
a = 17%
b = 12%
c = 4%
etc.
"a" has 1700 items in the list. "b" has 1200 items in the list. "c" has 400 items in the list.
Instead of using all information, I want a sample that mimics the distribution of a, b, c, etc.
So the goal would be to end up with,
170 randomly selected items from "a" 120 randomly selected items from "b" 40 randomly selected items from "c"
I know how to randomly select information from the list, but I haven't been able to figure out how to randomly select while forcing the outcome to have the same distribution.
Upvotes: 3
Views: 10132
Reputation: 601
a pandas series/dataframe has a .sample() method that allows a 'weights' series to be included.
if a dataframe, that weight can be a column adjacent to the data.
make your category totals that weight column, specify that column in your .sample() call, and you're done.
https://pandas.pydata.org/docs/reference/api/pandas.Series.sample.html
Upvotes: 0
Reputation: 10598
It's pretty easy to do this manually. Let's store your data in a list of (value, probability)
objects:
data = [(a, 0.17), (b, 0.12), (c, 0.04), ...]
This is the function that will help you select random values that follow the probability distribution:
import random
def select_random_element(data):
sample_proba = random.uniform(0, 1)
total_proba = 0
for (value, proba) in data:
total_proba += proba
if total_proba >= sample_proba:
return value
Finally, this is how we select N random items:
random_items = [select_random_element(data) for _ in range(0, N)]
This does not require any additional memory. However, the time complexity is O(len(data)*N)
. This can be improved by sorting the data list by decreasing probability beforehand:
data = sorted(data, key=lambda i: i[1], reverse=True)
Note that I assumed that the total probability of your data is 1. If not, you should write random.uniform(0, total_probability)
instead of random.uniform(0, 1)
in the above code, with:
total_probability = sum([i[1] for i in data])
Upvotes: 0
Reputation: 54223
If your lists aren't humongous and if memory isn't a problem, you could use this simple method.
To get n
elements from a
, b
and c
, you could concatenate the three lists together and pick random elements from the resulting list with random.choice
:
import random
n = 50
a = ['a'] * 170
b = ['b'] * 120
c = ['c'] * 40
big_list = a + b + c
random_elements = [random.choice(big_list) for i in range(n)]
# ['a', 'c', 'a', 'a', 'a', 'b', 'a', 'c', 'b', 'a', 'c', 'a',
# 'a', 'a', 'a', 'b', 'b', 'a', 'a', 'a', 'a', 'a', 'c', 'a',
# 'c', 'a', 'b', 'a', 'a', 'c', 'a', 'b', 'a', 'c', 'b', 'a',
# 'a', 'b', 'a', 'b', 'a', 'a', 'c', 'a', 'c', 'a', 'b', 'c',
# 'b', 'b']
For each element, you'll get a len(a) / len(a + b + c)
probability to get an element from a
.
You might get the same element multiple times though. If you don't want this to happen, you could use random.shuffle
.
Upvotes: 5
Reputation: 3382
Just use shuffle
on your list, and take the first n elements.
Upvotes: -1
Reputation: 5831
One way to "mimic" such a distribution in your selection would be to simply combine the lists into one and then select the total needed number of items from that list. If the total number of items that needs to be selected is large, then this approximation will be good.
Note that it does not guarantee that exactly those quantities from each list will be selected. However, if the lists are large and there are many runs of this routine, the average should be good.
import random
total = a + b + c + ...
samples = []
number = len(total) / 10
for i in range(number):
samples.append(total[random.rand(0, len(total) - 1])
Upvotes: 0
Reputation: 13175
From my understanding, you have three distinct populations and you want to sample from these populations randomly, but with a skewed probability of picking certain populations. In this case, it's easier to first generate a list of indices randomly that correspond to each population (as I combined them into a single 2D array called combined
).
Then you can traverse the list of randomly generated indices, which gives you the population you're going to choose from, and then randomly pick from that data using np.random.choice()
.
import numpy as np
sample_a = np.arange(1, 1000)
sample_b = np.arange(1001, 2000)
sample_c = np.arange(2001, 3000)
combined = np.vstack((sample_a, sample_b, sample_c))
distributions = [0.7, 0.2, 0.1] # The skewed probability distribution for sampling
sample = np.random.choice([0, 1, 2], size=10, p=distributions) # Choose indices with skewed probability
combined_pool = []
for arr in sample:
combined_pool.append(np.random.choice(combined[arr]))
Upvotes: 0