Reputation: 31
I have a dataframe with over a million rows. For every row I have 4 columns, containing the weights. How can I efficiently sample for every row with their corresponding weights? I would just like to choose a number 1,2,3 or 4 for every row using the weights of each row. Right now I have this for-loop but that will take way too long.
df = pd.DataFrame({
'1': [0.155, 0.138, ...],
'2': [0.473, 0.307, ...],
'3': [0.291, 0.490, ...],
'4': [0.080, 0.064, ...],
'pick': ['']
})
for i in range(0, len(df)):
df['pick'][i] = random.choices([1,2,3,4], weights=[df['1'][i], df['2'][i], df['3'][i], df['4'][i]], k=1)
Upvotes: 3
Views: 479
Reputation: 788
To add to the previous answer:
You could use the apply function in stead of iterating over all the rows (which is generally slower).
First define a (lambda) function, then apply the function to each row:
pick_function = lambda row_vals : np.random.choice([1,2,3,4], p=row_vals)
df['pick'] = df.apply(pick_function,axis=1) # axis=1 -> passes the row's values as the argument
As an aside, the lambda function does the same as this:
def pick_function(row_vals):
rand_value = np.random.choice([1,2,3,4], p=row_vals)
return rand_value
Upvotes: 1
Reputation: 10624
Try with numpy, it is generally faster:
for i in range(len(df)):
df['pick'][i]=np.random.choice([1,2,3,4], 1, p=list(df.iloc[i,:4]))
However, as your weights do not always add to 1, change some column (eg 4th) this way, before:
df['4']=1-(df['1']+df['2']+df['3'])
Output:
1 2 3 4 pick
0 0.155 0.473 0.291 0.081 2
1 0.138 0.307 0.490 0.065 4
Upvotes: 3