Pandas: Adding a new column with random numbers to DF based on multiple criteria from row

Question

I am a beginner. I've looked all over and read a bunch of related questions but can't quite figure this out. I know I am the problem and that I'm missing something, but I'm hoping someone will be kind and help me out. I am attempting to convert data from one video game (a college basketball simulation) into data consistent with another video game's (pro basketball simulation) format.

I have a DF that has columns: Name, Pos, Height, Weight, Shot, Points

With values such as: Jon Smith, C, 84, 235, Exc, 19.4 Greg Jones, PG, 72, 187, Fair, 12.0

I want create a new column for "InsideScoring". What I'd like to do is assign a player a randomly generated number within a certain range based on what position they played, height, weight, shot rating and points scored.

I tried a bunch of attempts like:

df1['InsideScoring'] = 0
df1.loc[(df1.Pos == "C") &
        (df1.Height > 82) &
        (df1.Points > 19.0) &
        (df1.Weight > 229), 'InsideScoring'] = np.random.randint(85,100)

When I do this, all the players (row at column "InsideScoring") that meet the criteria get assigned the same value between 85 and 100 rather than a random mix of numbers between 85 and 100.

Eventually, what I want to do is go through the list of players and based on those four criteria, assign values from different ranges. Any ideas appreciated.

Pandas: Create a new column with random values based on conditional

Numpy "where" with multiple conditions

user3483203 · Accepted Answer

My recommendation would be to use np.select here. You set up your conditions, your outputs, and you're good to go. However, to avoid iteration, but also to avoid assigning the same random value to every column that meets the condition, create random values equal to the length of your DataFrame, and select from those:

Setup

df = pd.DataFrame({
    'Name': ['Chris', 'John'],
    'Height': [72, 84],
    'Pos': ['PG', 'C'],
    'Weight': [165, 235], 
    'Shot': ['Amazing', 'Fair'],
    'Points': [999, 25]
})

    Name  Height Pos  Weight     Shot  Points
0  Chris      72  PG     165  Amazing     999
1   John      84   C     235     Fair      25

Now set up your ranges and your conditions (Create as many of these as you like):

cond1 = df.Pos.eq('C') & df.Height.gt(80) & df.Weight.gt(200)
cond2 = df.Pos.eq('PG') & df.Height.lt(80) & df.Weight.lt(200)

range1 = np.random.randint(85, 100, len(df))
range2 = np.random.randint(50, 85, len(df))

df.assign(InsideScoring=np.select([cond1, cond2], [range1, range2]))

    Name  Height Pos  Weight     Shot  Points  InsideScoring
0  Chris      72  PG     165  Amazing     999             72
1   John      84   C     235     Fair      25             89

Now to verify this doesn't assign values more than once:

df = pd.concat([df]*5)

... # Setup the ranges and conditions again

df.assign(InsideScoring=np.select([cond1, cond2], [range1, range2]))

    Name  Height Pos  Weight     Shot  Points  InsideScoring
0  Chris      72  PG     165  Amazing     999             56
1   John      84   C     235     Fair      25             96
0  Chris      72  PG     165  Amazing     999             74
1   John      84   C     235     Fair      25             93
0  Chris      72  PG     165  Amazing     999             63
1   John      84   C     235     Fair      25             97
0  Chris      72  PG     165  Amazing     999             55
1   John      84   C     235     Fair      25             95
0  Chris      72  PG     165  Amazing     999             60
1   John      84   C     235     Fair      25             90

And we can see that random values are assigned, even though they all match one of two conditions. While this is less memory efficient than iterating and picking a random value, since we are creating a lot of unused numbers, it will still be faster as these are vectorized operations.

Pandas: Adding a new column with random numbers to DF based on multiple criteria from row

Answers (1)

Related Questions