Stats_kid
Stats_kid

Reputation: 63

How do I randomly sample from a list in python while maintaining the distribution of data

Essentially what I'm trying to do is randomly select items from a list while maintaining the internal distribution. See the following example.

a = 17%
b = 12%
c = 4%
etc.

"a" has 1700 items in the list. "b" has 1200 items in the list. "c" has 400 items in the list.

Instead of using all information, I want a sample that mimics the distribution of a, b, c, etc.

So the goal would be to end up with,

170 randomly selected items from "a" 120 randomly selected items from "b" 40 randomly selected items from "c"

I know how to randomly select information from the list, but I haven't been able to figure out how to randomly select while forcing the outcome to have the same distribution.

Upvotes: 3

Views: 10132

Answers (6)

dave campbell
dave campbell

Reputation: 601

a pandas series/dataframe has a .sample() method that allows a 'weights' series to be included.

if a dataframe, that weight can be a column adjacent to the data.

make your category totals that weight column, specify that column in your .sample() call, and you're done.

https://pandas.pydata.org/docs/reference/api/pandas.Series.sample.html

Upvotes: 0

Régis B.
Régis B.

Reputation: 10598

It's pretty easy to do this manually. Let's store your data in a list of (value, probability) objects:

data = [(a, 0.17), (b, 0.12), (c, 0.04), ...]

This is the function that will help you select random values that follow the probability distribution:

import random
def select_random_element(data):
    sample_proba = random.uniform(0, 1)
    total_proba = 0
    for (value, proba) in data:
        total_proba += proba
        if total_proba >= sample_proba:
            return value

Finally, this is how we select N random items:

random_items = [select_random_element(data) for _ in range(0, N)]

This does not require any additional memory. However, the time complexity is O(len(data)*N). This can be improved by sorting the data list by decreasing probability beforehand:

data = sorted(data, key=lambda i: i[1], reverse=True)

Note that I assumed that the total probability of your data is 1. If not, you should write random.uniform(0, total_probability) instead of random.uniform(0, 1) in the above code, with:

total_probability = sum([i[1] for i in data])

Upvotes: 0

Eric Duminil
Eric Duminil

Reputation: 54223

If your lists aren't humongous and if memory isn't a problem, you could use this simple method.

To get n elements from a, b and c, you could concatenate the three lists together and pick random elements from the resulting list with random.choice:

import random

n = 50
a = ['a'] * 170
b = ['b'] * 120
c = ['c'] * 40
big_list = a + b + c
random_elements = [random.choice(big_list) for i in range(n)]
# ['a', 'c', 'a', 'a', 'a', 'b', 'a', 'c', 'b', 'a', 'c', 'a',
# 'a', 'a', 'a', 'b', 'b', 'a', 'a', 'a', 'a', 'a', 'c', 'a',
# 'c', 'a', 'b', 'a', 'a', 'c', 'a', 'b', 'a', 'c', 'b', 'a',
# 'a', 'b', 'a', 'b', 'a', 'a', 'c', 'a', 'c', 'a', 'b', 'c',
# 'b', 'b']

For each element, you'll get a len(a) / len(a + b + c) probability to get an element from a.

You might get the same element multiple times though. If you don't want this to happen, you could use random.shuffle.

Upvotes: 5

Binyamin Even
Binyamin Even

Reputation: 3382

Just use shuffle on your list, and take the first n elements.

Upvotes: -1

MadPhysicist
MadPhysicist

Reputation: 5831

One way to "mimic" such a distribution in your selection would be to simply combine the lists into one and then select the total needed number of items from that list. If the total number of items that needs to be selected is large, then this approximation will be good.

Note that it does not guarantee that exactly those quantities from each list will be selected. However, if the lists are large and there are many runs of this routine, the average should be good.

import random
 total = a + b + c + ...
 samples = []
 number = len(total) / 10
 for i in range(number):
     samples.append(total[random.rand(0, len(total) - 1])

Upvotes: 0

roganjosh
roganjosh

Reputation: 13175

From my understanding, you have three distinct populations and you want to sample from these populations randomly, but with a skewed probability of picking certain populations. In this case, it's easier to first generate a list of indices randomly that correspond to each population (as I combined them into a single 2D array called combined).

Then you can traverse the list of randomly generated indices, which gives you the population you're going to choose from, and then randomly pick from that data using np.random.choice().

import numpy as np

sample_a = np.arange(1, 1000)
sample_b = np.arange(1001, 2000)
sample_c = np.arange(2001, 3000)

combined = np.vstack((sample_a, sample_b, sample_c))

distributions = [0.7, 0.2, 0.1] # The skewed probability distribution for sampling

sample = np.random.choice([0, 1, 2], size=10, p=distributions) # Choose indices with skewed probability

combined_pool = []

for arr in sample:
    combined_pool.append(np.random.choice(combined[arr]))

Upvotes: 0

Related Questions