Jon
Jon

Reputation: 61

Random sampling and Pandas dataframes

I have the following dataframe, cr_df, which shows the rate at which ID1 converts to ID2

   ID1 ID2 Conversion Rate
0  1     A      0.046562
1  1     B      0.315975
2  1     C      0.577998
3  1     D      0.059465
4  2     A      0.6
5  2     B      0.4

Then I have another dataframe, raw_df, in the format of ID1 such as:

   ID1 Value
0  1     100  
1  2     200

My goal is to output a dataframe final_df, in the ID2 format that looks something like:

   ID2 Value
0  C     100  
1  A     200

Where the mapping from ID1 consists of selecting a random value between 0 and 1 and picking the ID2 based off the conversion rates.

How can I achieve this in pandas? (Do I need to use .apply?)

Upvotes: 1

Views: 1002

Answers (2)

unutbu
unutbu

Reputation: 880777

Given this setup:

import numpy as np
import pandas as pd

df = pd.DataFrame({
    'ID1': [1]*4+[2]*2, 'ID2':list('ABCDAB'), 
    'Conversion Rate': [0.046562, 0.315975, 0.577998, 0.059465, 0.6, 0.4]})
raw_df = pd.DataFrame({'ID1': [1,2], 'Value':[100, 200]})

you could define a function random_id2:

def random_id2(x):
    return np.random.choice(x['ID2'], p=x['Conversion Rate'].values)

and use groupby/apply:

id2 = df.groupby(['ID1']).apply(random_id2)

to obtain the Series

ID1
1    C
2    A
dtype: object

You could then build final_df by mapping raw_df['ID1'] values to id2 values:

final_df = raw_df.copy()
final_df['ID1'] = final_df['ID1'].map(id2)
final_df = final_df.rename(columns={'ID1': 'ID2'})

import numpy as np
import pandas as pd

df = pd.DataFrame({
    'ID1': [1]*4+[2]*2, 'ID2':list('ABCDAB'), 
    'Conversion Rate': [0.046562, 0.315975, 0.577998, 0.059465, 0.6, 0.4]})
raw_df = pd.DataFrame({'ID1': [1,2], 'Value':[100, 200]})

def random_id2(x):
    return np.random.choice(x['ID2'], p=x['Conversion Rate'].values)

id2 = df.groupby(['ID1']).apply(random_id2)

final_df = raw_df.copy()
final_df['ID1'] = final_df['ID1'].map(id2)
final_df = final_df.rename(columns={'ID1': 'ID2'})

print(final_df)

yields

  ID2  Value
0   C    100
1   A    200

Upvotes: 1

Ami Tavory
Ami Tavory

Reputation: 76406

You can do a combination of the following:

  • To make a weighted random choice of the rows, use the answer in this question; specifically, make a weighted selection of range(len(df)) with the weights given by df[Conversion Rate].

  • To select the rows with the given indices, see here.

  • To join the resulting dataframe with the second one, use merge

Upvotes: 1

Related Questions