Reputation: 2333
I have the following dataframe:
MyAge Ages Names
7 [3,10,15] ['Tom','Jack','Sara']
6 [12,6,5,13] ['Nora','Betsy','John','Jill']
15 [24,3,65,15]['Tala','Jane','Bill','Mark']
I want to generate a new column that produces a randomly selected name for each row from the list of Names
, so that the age of the person with that randomly selected name is less than or equal to MyAge
. The column Ages
reflects the ages of each person in the Names
column.
So one possible outcome is the following:
MyAge Ages Names RandomName RandomPersonAge
7 [3,10,15] ['Tom','Jack','Sara'] 'Tom' 3
6 [12,6,5,13] ['Nora','Betsy','John','Jill'] 'Betsy' 6
15 [24,3,65,15]['Tala','Jane','Bill','Mark'] 'Jane' 3
Upvotes: 0
Views: 1135
Reputation: 109576
Given that the number of ages and names can be different for each row, first create a random index to based on the number of ages/names per row using a list comprehension. Then use more list comprehensions to index the names and ages. Finally, assign the results back to the original dataframe.
# Sample data.
df = pd.DataFrame({
"MyAge": [7, 6, 15],
"Ages": [[3, 10, 15], [12, 6, 5, 13], [24, 3, 65, 15]],
"Names": [['Tom', 'Jack', 'Sara'], ['Nora', 'Betsy', 'John', 'Jill'], ['Tala', 'Jane', 'Bill', 'Mark']]
})
# Solution.
np.random.seed(0)
random_index = [np.random.randint(len(ages)) for ages in df['Ages']]
names = [names[idx] for idx, names in zip(random_index, df['Names'])]
ages = [ages[idx] for idx, ages in zip(random_index, df['Ages'])]
>>> df.assign(RandomName=names, RandomPersonAge=ages)
MyAge Ages Names RandomName RandomPersonAge
0 7 [3, 10, 15] [Tom, Jack, Sara] Tom 3
1 6 [12, 6, 5, 13] [Nora, Betsy, John, Jill] Jill 13
2 15 [24, 3, 65, 15] [Tala, Jane, Bill, Mark] Jane 3
To choose the random ages such that they are less than or equal to the value in MyAge
, we should first flatten the data. We'll use a conditional, nested list comprehension to filter the data such that each row contains the index together with the name and equivalent age where the age is less than or equal to MyAge
. We'll then create a dataframe from this filtered data and set the index based on the first column which is the name to the original dataframe's index. The rows in the dataframe are randomly shuffled via sample(frac=1)
. We then group on the index and take the first random row. We then join the result back to the original dataframe (the join is done based on the index by default).
filtered_data = (
[(idx, name, age)
for idx, (my_age, ages, names) in df.iterrows()
for age, name in zip(ages, names)
if age <= my_age]
)
random_names_and_ages = (
pd.DataFrame(filtered_data, columns=[df.index.name, 'RandomName', 'RandomPersonAge'])
.set_index(df.index.name)
.sample(frac=1) # Randomly huffle the rows in the dataframe.
.groupby(level=0)[['RandomName', 'RandomPersonAge']] # Groupby 'ID' and take the first random row.
.first()
)
>>> df.join(random_names_and_ages)
MyAge Ages Names RandomName \
0 7 [3, 10, 15] [Tom, Jack, Sara] Tom
1 6 [12, 6, 5, 13] [Nora, Betsy, John, Jill] John
2 15 [24, 3, 65, 15] [Tala, Jane, Bill, Mark] Jane
RandomPersonAge
0 3
1 5
2 3
Upvotes: 2