Reputation: 118
I am trying to generate a synthetic data set. I have managed to generate a few columns but I need to generate a column of random numbers based on a condition of another column.
def create_trans_dataset(num=1):
output=[
{"trans_date": np.random.choice(check),
"trans_details":np.random.choice(["airtime_purchase",
"customer_transfer",
"deposit_funds",
"withdrawal_amount"],
p=[0.2, 0.2, 0.2, 0.1, 0.1, 0.2]),
"trans_status": np.random.choice(["completed", "reversed",
"procesing"],
p=[0.9, 0.05, 0.05])
}
for x in range(num)
]
return output
trans_dataset = pd.DataFrame(create_dataset(num=20))
def map_values(row, values_dict):
return values_dict[row]
values_dict = {"airtime_purchase": random.randint(5, 5000),
"customer_transfer": random.randint(100, 35000),
"deposit_funds": random.randint(100, 35000),
"withdrawal": random.randint(100, 35000)
}
df['trans_details'] = df['trans_details'].apply(map_values, args = (values_dict,))
My current solution is producing a constant number for "airtime_purchase", "customer_transfer", "deposit_funds", and "withdrawal". My current output is
trans_date trans_details trans_status amount_transacted
0 2020-02-27 customer_transfer completed 30165
1 2020-03-03 airtime_purchase completed 14945
2 2020-01-02 withdrawal completed 14595
3 2020-01-01 withdrawal completed 26700
4 2020-02-18 airtime_purchase completed 22860
5 2020-02-22 airtime_purchase completed 17930
6 2020-01-01 airtime_purchase completed 24370
7 2020-01-20 customer_transfer completed 8735
8 2020-03-12 deposit_funds completed 1065
9 2020-03-20 airtime_purchase completed 27170
My desired output is to have a random number for all customer_transfers, airtime_purchases, deposit_funds, and withdrawals as shown below.
trans_date trans_details trans_status amount_transacted
0 2020-02-27 customer_transfer completed 3015
1 2020-03-03 airtime_purchase completed 1495
2 2020-01-02 withdrawal completed 1595
3 2020-01-01 withdrawal completed 2600
4 2020-02-18 airtime_purchase completed 2890
5 2020-02-22 airtime_purchase completed 930
6 2020-01-01 airtime_purchase completed 370
7 2020-01-20 customer_transfer completed 9635
8 2020-03-12 deposit_funds completed 5005
9 2020-03-20 airtime_purchase completed 2817
Upvotes: 0
Views: 1076
Reputation: 22031
I think you can simply do:
def create_trans_dataset(num=1):
output=[
{"trans_date": np.random.randint(0,100),
"trans_details":np.random.choice(["airtime_purchase",
"customer_transfer",
"deposit_funds",
"withdrawal"],
p=[0.2, 0.2, 0.2, 0.4]),
"trans_status": np.random.choice(["completed", "reversed",
"procesing"],
p=[0.9, 0.05, 0.05])
}
for x in range(num)
]
return output
trans_dataset = pd.DataFrame(create_trans_dataset(num=100))
trans_dataset['original_trans_details'] = trans_dataset['trans_details'].copy()
count = trans_dataset.trans_details.value_counts()
trans_dataset.loc[trans_dataset.trans_details!='airtime_purchase','trans_details'] = np.random.randint(100, 35000, count.sum()-count['airtime_purchase'])
trans_dataset.loc[trans_dataset.trans_details=='airtime_purchase','trans_details'] = np.random.randint(5, 5000, count['airtime_purchase'])
this generates random numbers for customer_transfer, deposit_funds, withdrawal between 100-35000 ALL different and random numbers for airtime_purchase between 5-5000 ALL different
Upvotes: 1