J_p
J_p

Reputation: 475

Create random.randint with condition in a group by?

I have a column called: cars and want to create another called persons using random.randint() which i have:

dat['persons']=np.random.randint(1,5,len(dat))

This is so I can put the number of persons who use these but I'd like to know how to put a condition so in the suv category will be generated only numbers from 4 to 9 for example.

cars | persons
suv     4
sedan   2
truck   2         
suv     1      
suv     5

Upvotes: 2

Views: 957

Answers (4)

Dan
Dan

Reputation: 773

I had a similar problem. I'll describe what I did generally because application may vary. For smaller frames it won't matter so the above methods might work but for larger frames like mine (i.e.; hundreds of thousands to millions of rows) I would do this:

  1. Sort dat by 'cars'
  2. Get a unique list of cars
  3. Create a temporary list for the random numbers
  4. Loop over that list of cars and populate the temporary list of random numbers and extending a new list with the temp list
  5. Add the new list to the 'persons' column
  6. If order matters maintain and re-sort by the index

Upvotes: -1

cs95
cs95

Reputation: 402844

Option 1
So, you're generating random numbers between 1 and 5, whereas numbers in the SUV category should be between 4 and 9. That just means you can generate a random number, and then add 4 to all random numbers belonging to the SUV category?

df = df.assign(persons=np.random.randint(1,5, len(df)))
df.loc[df.cars == 'suv', 'persons'] += 4

df

    cars  persons
0    suv        7
1  sedan        3
2  truck        1
3    suv        8
4    suv        8

Option 2
Another alternative would be using np.where -

df.persons = np.where(df.cars == 'suv', 
                      np.random.randint(5, 9, len(df)), 
                      np.random.randint(1, 5, len(df)))
df

    cars  persons
0    suv        8
1  sedan        1
2  truck        2
3    suv        5
4    suv        6

Upvotes: 1

Martijn Pieters
Martijn Pieters

Reputation: 1123550

You can create an index for your series, where matching rows have True, and everything else has False. You can then assign to the rows matching that index using loc[] to select the rows; you then generate just the number of values for those selected rows:

m = dat['cars'] == 'suv'
dat.loc[m, 'persons'] = np.random.randint(4, 9, m.sum())

You could also use apply on the cars series to create the new column, creating a new random value in each call:

dat['persons'] = dat.cars.apply(
    lambda c: random.randint(4, 9) if c == 'suv' else random.randint(1, 5))

But this has to make a separate function call for each row. Using a mask will be more efficient.

Upvotes: 2

Jacob H
Jacob H

Reputation: 607

There may be a way to do this with something like a groupby that's more clever than I am, but my approach would be to build a function and apply it to your cars column. This is pretty flexible - it will be easy to build in more complicated logic if you want something different for each car:

def get_persons(car):
    if car == 'suv':
        return np.random.randint(4, 9)
    else:
        return np.random.randint(1, 5)
dat['persons'] = dat['cars'].apply(get_persons)

or in a more slick, but less flexible way:

dat['persons'] = dat['cars'].apply(lambda car: np.random.randint(4, 9) if car == 'suv' else np.random.randint(1, 5))

Upvotes: 0

Related Questions