Confusion Matrix
Confusion Matrix

Reputation: 118

Generate random values and map them to a column based on condition in pandas

I am trying to generate a synthetic data set. I have managed to generate a few columns but I need to generate a column of random numbers based on a condition of another column.

def create_trans_dataset(num=1):
    output=[
            {"trans_date": np.random.choice(check),
             "trans_details":np.random.choice(["airtime_purchase",
                                               "customer_transfer",
                                               "deposit_funds",
                                               "withdrawal_amount"],
                                              p=[0.2, 0.2, 0.2, 0.1, 0.1, 0.2]),
             "trans_status": np.random.choice(["completed", "reversed",
                                               "procesing"],
                                               p=[0.9, 0.05, 0.05])
           }
            for x in range(num)
          ]
    return output

trans_dataset = pd.DataFrame(create_dataset(num=20))

def map_values(row, values_dict):
    return values_dict[row]

values_dict = {"airtime_purchase": random.randint(5, 5000),
               "customer_transfer": random.randint(100, 35000),
               "deposit_funds": random.randint(100, 35000),
               "withdrawal": random.randint(100, 35000)
            }

df['trans_details'] = df['trans_details'].apply(map_values, args = (values_dict,))

My current solution is producing a constant number for "airtime_purchase", "customer_transfer", "deposit_funds", and "withdrawal". My current output is

trans_date  trans_details           trans_status    amount_transacted
0   2020-02-27  customer_transfer   completed        30165
1   2020-03-03  airtime_purchase    completed        14945
2   2020-01-02  withdrawal          completed        14595
3   2020-01-01  withdrawal          completed        26700
4   2020-02-18  airtime_purchase    completed        22860
5   2020-02-22  airtime_purchase    completed        17930
6   2020-01-01  airtime_purchase    completed        24370
7   2020-01-20  customer_transfer   completed        8735
8   2020-03-12  deposit_funds       completed        1065
9   2020-03-20  airtime_purchase    completed        27170

My desired output is to have a random number for all customer_transfers, airtime_purchases, deposit_funds, and withdrawals as shown below.

trans_date  trans_details           trans_status    amount_transacted
0   2020-02-27  customer_transfer   completed        3015
1   2020-03-03  airtime_purchase    completed        1495
2   2020-01-02  withdrawal          completed        1595
3   2020-01-01  withdrawal          completed        2600
4   2020-02-18  airtime_purchase    completed        2890
5   2020-02-22  airtime_purchase    completed        930
6   2020-01-01  airtime_purchase    completed        370
7   2020-01-20  customer_transfer   completed        9635
8   2020-03-12  deposit_funds       completed        5005
9   2020-03-20  airtime_purchase    completed        2817

Upvotes: 0

Views: 1076

Answers (1)

Marco Cerliani
Marco Cerliani

Reputation: 22031

I think you can simply do:

def create_trans_dataset(num=1):
    output=[
            {"trans_date": np.random.randint(0,100),
             "trans_details":np.random.choice(["airtime_purchase",
                                               "customer_transfer",
                                               "deposit_funds",
                                               "withdrawal"],
                                              p=[0.2, 0.2, 0.2, 0.4]),
             "trans_status": np.random.choice(["completed", "reversed",
                                               "procesing"],
                                               p=[0.9, 0.05, 0.05])
           }
            for x in range(num)
          ]
    return output

trans_dataset = pd.DataFrame(create_trans_dataset(num=100))
trans_dataset['original_trans_details'] = trans_dataset['trans_details'].copy()

count = trans_dataset.trans_details.value_counts()
trans_dataset.loc[trans_dataset.trans_details!='airtime_purchase','trans_details'] = np.random.randint(100, 35000, count.sum()-count['airtime_purchase'])
trans_dataset.loc[trans_dataset.trans_details=='airtime_purchase','trans_details'] = np.random.randint(5, 5000, count['airtime_purchase'])

this generates random numbers for customer_transfer, deposit_funds, withdrawal between 100-35000 ALL different and random numbers for airtime_purchase between 5-5000 ALL different

enter image description here

Upvotes: 1

Related Questions