fastest way to generate column with random elements based on another column

Question

I have a dataframe of ~20M lines

I have a column called A that gives me an id (there are ~10K ids in total). The value of this id defines a random distribution's parameters. Now I want to generate a column B, that is randomly drawn from the distribution that is defined by the value in the column A

What is the fastest way to do this? Doing something with iterrows or apply is extremely slow. Another possiblity is to group by A, and generate all my data for each value of A (so I only draw from one distribution). But then I don't end up with a Dataframe but with a "groupBy" object, and I don't know how to go back to having the initial dataframe, plus my new column.

BradMcDanel · Accepted Answer

I think this approach is similar to what you were describing, where you generate the samples for each id. On my machine, it appears this would take around 5 minutes to run. I assume you can trivially get the ids.

import numpy as np

num_ids = 10000
num_rows = 20000000
ids = np.arange(num_ids)
loc_params = np.random.random(num_ids)
A = np.random.randint(0, num_ids, num_rows)
B = np.zeros(A.shape)

for idx in ids:
    A_idxs = A == idx
    B[A_idxs] = np.random.normal(np.sum(A_idxs), loc_params[idx])

fastest way to generate column with random elements based on another column

Answers (2)

Related Questions