Dan
Dan

Reputation: 773

Using groupby to speed up random number generation in part of a dataframe

I have a program that uses a mask similar to the check marked answer shown here to create multiple sets of random numbers in a dataframe, df.

Create random.randint with condition in a group by?

My code:

for city in state:
    mask = df['City'] == city
    df.loc[mask, 'Random'] = np.random.randint(1, 200, mask.sum())

This takes quite some time the bigger dataframe df is. Is there a way to speed this up with groupby?

Upvotes: 0

Views: 650

Answers (2)

Dan
Dan

Reputation: 773

I've figured out a much quicker way to do this. I'll keep it more general given the application might be different depending on what you want to achieve and keep Corralien's answer as the check mark.

Instead of creating a mask or group and using .loc to update the dataframe in place, I sorted the dataframe by the 'City' then created a list of unique values from my 'City' column.

Looping over the unique list (i.e.; the grouping), I generated the random numbers for each grouping, putting them in a new list using the .extend() function. I then added the 'Random' column from this list, and sorted the dataframe back using the index.

Upvotes: 0

Corralien
Corralien

Reputation: 120479

You can try:

df['Random'] = df.assign(Random=0).groupby(df['City'])['Random'] \
                 .transform(lambda x: np.random.randint(1, 200, len(x)))

Upvotes: 0

Related Questions