Dr.Chuck
Dr.Chuck

Reputation: 223

Select Random Data From Python Dataframe based on Columns data distribution & Conditions

I have 12 rows in my input data and need to select 4 random rows which keeps the columns distribution in focus at the time of random selection.

This is a sample data, original data contains million rows.

Input data Sample -

input_data = pd.DataFrame({'Id': ['A','A','A','A','A','A','B','B','B','C','C','C'],
           'Fruit': ['Apple','Mango','Orange','Apple','Apple','Mango','Apple','Mango','Apple','Apple','Apple','Orange'],
             'City':['California','California','Chicago','Michigan','New York','Ohio','Michigan',
                     'Michigan','Ohio','Florida','New York','Washington']})

Output Data Expectation -

output_data = pd.DataFrame({'Id': ['A','A','B','C'],
               'Fruit': ['Apple','Mango','Apple','Orange'],
                 'City':['California','Ohio','Michigan','New York']})

My random selection should consider the below three parameters -

  1. The Id distribution, in below image, out of 4, 2 rows should be selected from A, 1 row from B and one from C

enter image description here

  1. The Fruit distribution, 2 rows for Apple, 1 for Mango and 1 for Orange

enter image description here

  1. The data should prioritize the higher frequency Cities

enter image description here

I am aware of sampling the data using pandas sample function and tried that which gives me unbalanced selection -

input_data.sample(n = 4)

Any leads on how to attend the problem is really appreciated!

Upvotes: 1

Views: 565

Answers (2)

DanielTuzes
DanielTuzes

Reputation: 2754

You are prescribing probabilities on single random variables 3 times, once on the ID, once on fruit and once on the city, whereas you need to select an ordered tuple of 3: (ID, fruit, city), and you have restriction on the possible combinations too. In general, it is not possible. I'll explain why not so that you can modify your question to match your needs.

Forget about how pandas help you to make random choices and let's understand the problem mathematically first. Let's simplify the problem into 2D. Keep the fruits (apple, mango, orange) and cities (Ohio, Florida). First, let's suppose you have all the possible combination:

unique ID Fruit City
0 Apple Ohio
1 Apple Florida
2 Mango Ohio
3 Mango Florida
4 Orange Ohio
5 Orange Florida

Then you define the probability for the different categories independently via their frequency:

Fruit frequency probability
0 Apple 5 0.5
1 Mango 2 0.2
2 Orange 3 0.3
City frequency probability
0 Ohio 2 0.2
1 Florida 8 0.8

Now you can represent your possible choices:

all combination is allowed

Each line in your list of possible choices are represented in the figure (their ID is written into the center of the cells together with their possibilities). By selecting a line from the table, it means you generate a point on this 2D space. If you use the area of the cells to determine the probability you choose that pair, you'll get the desired 1D probability distributions, hence this representation. Of course, it is a good, intuitive choice to generate a random number on a 2D (discrete) space by generating 2 random numbers in each dimension, but this is not a must. In this example, the individual properties are independent, meaning that if your line's fruit property is apple, then it has a 20% or 80% probability, that it is from Ohio or Florida, respectively, which is equal to the original 1D distribution you prescribed for cities.

Now consider if you have an extra entry with unique id 6 for (Orange, Florida). When you generate a point on the 2D space, and it falls onto cells 5 and 6, you have the freedom to choose from the 5th or the 6th line. This case occurs if you have repeated set of tuples. (If your full table of all the 3 properties is considered, then you don't have repeated tuples).

Now consider what happens if you keep the prescribed 1D probabilities but don't represent all the possibilities, e.g. by removing the entry (Apple, Florida) with ID 1. You cannot generate points on the cell with number 1 anymore, but this affects the 1D probabilities you prescribed. If you can resolve this issue by redistributing the removed 40% so that individual category probabilities will be the one you desire, then you can select lines with the probability of the properties you want. This case occurs in your table, because not every possibility is listed.

If you can redistribute the probabilities, e.g. according to the following table (by scaleing up everything by 100%/(100%-40%), then not all the variables will be independent anymore. E.g. if it is apple, then it must have city Ohio (instead of 20% - 80% probability share with Florida).

a subset of the combinations is allowed

You mentioned that you have millions of rows. Maybe in your complete table all possible combinations can be found, and you don't need to deal with this problem. You can also extend your table so that it contains all the possible combinations, and you can later decide how to interpret the results when you selected a row not contained in your full table initially.

Upvotes: 2

Adam Kubalica
Adam Kubalica

Reputation: 128

this doesn't include the 'city' column, but it's a start:

# the usual:
import pandas as pd
import numpy as np
from random import sample
# our df:
df = pd.DataFrame({'Id': ['A','A','A','A','A','A','B','B','B','C','C','C'],
           'Fruit': ['Apple','Mango','Orange','Apple','Apple','Mango','Apple','Mango','Apple','Apple','Apple','Orange'],
             'City':['California','California','Chicago','Michigan','New York','Ohio','Michigan',
                     'Michigan','Ohio','Florida','New York','Washington']})
# the fun part:
def lets_check_it(df):
    # take 2 samples with 'A' Id, one sample with 'B', one sample with 'C' put them all in a df:
    result = pd.concat([df[df['Id']=='A'].sample(1),df[df['Id']=='A'].sample(1),df[df['Id']=='B'].sample(1),df[df['Id']=='C'].sample(1)])
    # if Apple or Orange are not in results, keep on sampling:
    while ('Apple' not in result['Fruit'].value_counts().index.tolist()) | ('Orange' not in result['Fruit'].value_counts().index.tolist()):  
        result = pd.concat([df[df['Id']=='A'].sample(1),df[df['Id']=='A'].sample(1),df[df['Id']=='B'].sample(1),df[df['Id']=='C'].sample(1)])   
    else:
    # if  Apple and Orange are in results, we have to check if it's 2 of 'Apple' and 1 of 'Orange that's the result we want
        while (result['Fruit'].value_counts()['Apple'] != 2) | (result['Fruit'].value_counts()['Orange'] != 1):
        # if it's not the desired result, run the whole function again:
            return lets_check_it(df) 
        # if it's the desired result, return the result:
        else:
            return result

Not sure how this is going to play out time-wise with millions of rows.

Upvotes: 0

Related Questions