BKS
BKS

Reputation: 2333

how to randomly select from a list in a column of lists in a pandas dataframe

I have the following dataframe:

MyAge    Ages       Names
7       [3,10,15]   ['Tom','Jack','Sara']
6       [12,6,5,13] ['Nora','Betsy','John','Jill']
15      [24,3,65,15]['Tala','Jane','Bill','Mark']

I want to generate a new column that produces a randomly selected name for each row from the list of Names, so that the age of the person with that randomly selected name is less than or equal to MyAge. The column Ages reflects the ages of each person in the Names column.

So one possible outcome is the following:

MyAge    Ages       Names                             RandomName   RandomPersonAge
7       [3,10,15]   ['Tom','Jack','Sara']             'Tom'        3 
6       [12,6,5,13] ['Nora','Betsy','John','Jill']    'Betsy'      6
15      [24,3,65,15]['Tala','Jane','Bill','Mark']     'Jane'       3

Upvotes: 0

Views: 1135

Answers (1)

Alexander
Alexander

Reputation: 109576

Given that the number of ages and names can be different for each row, first create a random index to based on the number of ages/names per row using a list comprehension. Then use more list comprehensions to index the names and ages. Finally, assign the results back to the original dataframe.

# Sample data.
df = pd.DataFrame({
    "MyAge": [7, 6, 15],
    "Ages": [[3, 10, 15], [12, 6, 5, 13], [24, 3, 65, 15]],
    "Names": [['Tom', 'Jack', 'Sara'], ['Nora', 'Betsy', 'John', 'Jill'], ['Tala', 'Jane', 'Bill', 'Mark']]
})

# Solution.
np.random.seed(0)
random_index = [np.random.randint(len(ages)) for ages in df['Ages']]
names = [names[idx] for idx, names in zip(random_index, df['Names'])]
ages = [ages[idx] for idx, ages in zip(random_index, df['Ages'])]
>>> df.assign(RandomName=names, RandomPersonAge=ages)
    MyAge   Ages        Names                     RandomName    RandomPersonAge
0   7   [3, 10, 15]     [Tom, Jack, Sara]         Tom            3
1   6   [12, 6, 5, 13]  [Nora, Betsy, John, Jill] Jill          13
2   15  [24, 3, 65, 15] [Tala, Jane, Bill, Mark]  Jane           3

To choose the random ages such that they are less than or equal to the value in MyAge, we should first flatten the data. We'll use a conditional, nested list comprehension to filter the data such that each row contains the index together with the name and equivalent age where the age is less than or equal to MyAge. We'll then create a dataframe from this filtered data and set the index based on the first column which is the name to the original dataframe's index. The rows in the dataframe are randomly shuffled via sample(frac=1). We then group on the index and take the first random row. We then join the result back to the original dataframe (the join is done based on the index by default).

filtered_data = (
    [(idx, name, age) 
     for idx, (my_age, ages, names) in df.iterrows() 
     for age, name in zip(ages, names)
     if age <= my_age]
)
random_names_and_ages = (
    pd.DataFrame(filtered_data, columns=[df.index.name, 'RandomName', 'RandomPersonAge'])
    .set_index(df.index.name)
    .sample(frac=1)  # Randomly huffle the rows in the dataframe.
    .groupby(level=0)[['RandomName', 'RandomPersonAge']]  # Groupby 'ID' and take the first random row.
    .first()
)
>>> df.join(random_names_and_ages)
   MyAge             Ages                      Names RandomName  \
0      7      [3, 10, 15]          [Tom, Jack, Sara]        Tom   
1      6   [12, 6, 5, 13]  [Nora, Betsy, John, Jill]       John   
2     15  [24, 3, 65, 15]   [Tala, Jane, Bill, Mark]       Jane   

   RandomPersonAge  
0                3  
1                5  
2                3

Upvotes: 2

Related Questions