Ruben de Bever
Ruben de Bever

Reputation: 31

How can I get a column with random samples, based on weights contained in a pandas dataframe?

I have a dataframe with over a million rows. For every row I have 4 columns, containing the weights. How can I efficiently sample for every row with their corresponding weights? I would just like to choose a number 1,2,3 or 4 for every row using the weights of each row. Right now I have this for-loop but that will take way too long.

df = pd.DataFrame({
    '1': [0.155, 0.138, ...],
    '2': [0.473, 0.307, ...],
    '3': [0.291, 0.490, ...],
    '4': [0.080, 0.064, ...],
    'pick': ['']

})

for i in range(0, len(df)): 
    df['pick'][i] = random.choices([1,2,3,4], weights=[df['1'][i], df['2'][i], df['3'][i], df['4'][i]], k=1)

Upvotes: 3

Views: 479

Answers (2)

Andre
Andre

Reputation: 788

To add to the previous answer:

You could use the apply function in stead of iterating over all the rows (which is generally slower).

First define a (lambda) function, then apply the function to each row:

pick_function = lambda row_vals : np.random.choice([1,2,3,4], p=row_vals)

df['pick'] = df.apply(pick_function,axis=1) # axis=1 -> passes the row's values as the argument

As an aside, the lambda function does the same as this:

def pick_function(row_vals):
    rand_value = np.random.choice([1,2,3,4], p=row_vals)
    return rand_value

Upvotes: 1

IoaTzimas
IoaTzimas

Reputation: 10624

Try with numpy, it is generally faster:

for i in range(len(df)):
    df['pick'][i]=np.random.choice([1,2,3,4], 1, p=list(df.iloc[i,:4]))

However, as your weights do not always add to 1, change some column (eg 4th) this way, before:

df['4']=1-(df['1']+df['2']+df['3'])

Output:

       1      2      3      4  pick
0  0.155  0.473  0.291  0.081     2
1  0.138  0.307  0.490  0.065     4

Upvotes: 3

Related Questions