lezebulon
lezebulon

Reputation: 7994

fastest way to generate column with random elements based on another column

I have a dataframe of ~20M lines

I have a column called A that gives me an id (there are ~10K ids in total). The value of this id defines a random distribution's parameters. Now I want to generate a column B, that is randomly drawn from the distribution that is defined by the value in the column A

What is the fastest way to do this? Doing something with iterrows or apply is extremely slow. Another possiblity is to group by A, and generate all my data for each value of A (so I only draw from one distribution). But then I don't end up with a Dataframe but with a "groupBy" object, and I don't know how to go back to having the initial dataframe, plus my new column.

Upvotes: 0

Views: 467

Answers (2)

BradMcDanel
BradMcDanel

Reputation: 553

I think this approach is similar to what you were describing, where you generate the samples for each id. On my machine, it appears this would take around 5 minutes to run. I assume you can trivially get the ids.

import numpy as np

num_ids = 10000
num_rows = 20000000
ids = np.arange(num_ids)
loc_params = np.random.random(num_ids)
A = np.random.randint(0, num_ids, num_rows)
B = np.zeros(A.shape)

for idx in ids:
    A_idxs = A == idx
    B[A_idxs] = np.random.normal(np.sum(A_idxs), loc_params[idx])

Upvotes: 2

knowa42
knowa42

Reputation: 404

This question is pretty vague, but how would this work for you?

df['B'] = df.apply(lambda row: distribution(row.A), axis=1)

Editing from question edits (apply is too slow):

You could create a mapping dictionary for the 10k ids to their generated value, then do something like

df['B'] = df['A'].map(dictionary)

I'm unsure if this will be faster than apply, but it will require fewer calls to your random distribution generator

Upvotes: 2

Related Questions