Reputation: 7994
I have a dataframe of ~20M lines
I have a column called A
that gives me an id (there are ~10K ids in total).
The value of this id defines a random distribution's parameters.
Now I want to generate a column B
, that is randomly drawn from the distribution that is defined by the value in the column A
What is the fastest way to do this? Doing something with iterrows
or apply
is extremely slow. Another possiblity is to group by A
, and generate all my data for each value of A (so I only draw from one distribution). But then I don't end up with a Dataframe but with a "groupBy" object, and I don't know how to go back to having the initial dataframe, plus my new column.
Upvotes: 0
Views: 467
Reputation: 553
I think this approach is similar to what you were describing, where you generate the samples for each id. On my machine, it appears this would take around 5 minutes to run. I assume you can trivially get the ids.
import numpy as np
num_ids = 10000
num_rows = 20000000
ids = np.arange(num_ids)
loc_params = np.random.random(num_ids)
A = np.random.randint(0, num_ids, num_rows)
B = np.zeros(A.shape)
for idx in ids:
A_idxs = A == idx
B[A_idxs] = np.random.normal(np.sum(A_idxs), loc_params[idx])
Upvotes: 2
Reputation: 404
This question is pretty vague, but how would this work for you?
df['B'] = df.apply(lambda row: distribution(row.A), axis=1)
Editing from question edits (apply is too slow):
You could create a mapping dictionary for the 10k ids to their generated value, then do something like
df['B'] = df['A'].map(dictionary)
I'm unsure if this will be faster than apply, but it will require fewer calls to your random distribution generator
Upvotes: 2