Reputation: 85
I get that this will create a dataframe of a single sample:
samples = np.random.normal(loc=df_avgs['AVERAGE'][region], scale=df_avgs['STDEV'][region], size=1)
But I want to create a sample for each row, based on a condition. For instance, I have a df of means, stdev and a df of conditions.
df_avgs
REGION | AVERAGE | STDEV |
---|---|---|
0 | -1.61 | 7.75 |
1 | 2.87 | 8.38 |
2 | 3.61 | 7.61 |
3 | -10.26 | 9.19 |
df_conditions
ID | REGION_NAME |
---|---|
0 | Region 0 |
1 | Region 3 |
2 | Region 2 |
3 | Region 1 |
4 | Region 1 |
5 | Region 2 |
6 | Region 3 |
How do I create a df of length(df_conditions) or just add a column to df_conditions, with samples based on the region?
Upvotes: 0
Views: 61
Reputation: 14369
Once you have merged the dfs
, the means and stds can be passed as arrays:
df_conditions = (
df_conditions.assign(
REGION=df_conditions['REGION_NAME'].str.extract(r'(\d+)$').astype(int)
)
.merge(df_avgs, on='REGION')
.assign(
SAMPLES = lambda x: np.random.default_rng().normal(
loc=x.pop('AVERAGE'),
scale=x.pop('STDEV'),
size=len(x.pop('REGION'))
)
)
)
Output:
ID REGION_NAME SAMPLES
0 0 Region 0 -13.940460
1 1 Region 3 -14.353592
2 2 Region 2 -4.011282
3 3 Region 1 0.664078
4 4 Region 1 -1.276447
5 5 Region 2 5.210611
6 6 Region 3 -15.929978
Explanation / intermediate
Series.str.extract
on df_conditions['REGION_NAME']
to extract the digits and assign as 'REGION' with df.assign
.df.merge
to add columns 'AVERAGE' and 'STDEV' from df_avgs
.# merged
ID REGION_NAME REGION AVERAGE STDEV
0 0 Region 0 0 -1.61 7.75
1 1 Region 3 3 -10.26 9.19
2 2 Region 2 2 3.61 7.61
3 3 Region 1 1 2.87 8.38
4 4 Region 1 1 2.87 8.38
5 5 Region 2 2 3.61 7.61
6 6 Region 3 3 -10.26 9.19
np.random.Generator.normal
(preferred to np.random.normal
) with triple df.pop
as we don't need the respective columns in the end result.Data used
import pandas as pd
import numpy as np
data = {'REGION': {0: 0, 1: 1, 2: 2, 3: 3},
'AVERAGE': {0: -1.61, 1: 2.87, 2: 3.61, 3: -10.26},
'STDEV': {0: 7.75, 1: 8.38, 2: 7.61, 3: 9.19}
}
df_avgs = pd.DataFrame(data)
data2 = {'ID': {0: 0, 1: 1, 2: 2, 3: 3, 4: 4, 5: 5, 6: 6},
'REGION_NAME': {0: 'Region 0', 1: 'Region 3', 2: 'Region 2', 3: 'Region 1',
4: 'Region 1', 5: 'Region 2', 6: 'Region 3'}}
df_conditions = pd.DataFrame(data2)
Upvotes: 0
Reputation: 153510
IIUC, you can merge the two dataframes together and then, assign the values using list comprehension with a zip of two dataframe columns:
df_zip = df_conditions.assign(REGION=df_conditions['REGION_NAME'].str.extract('([0-9])').astype(int)).merge(df_avgs)
df_conditions['SAMPLES'] = [np.random.normal(loc=l, scale=s, size=1)[0] for l, s in zip(df_zip['AVERAGE'], df_zip['STDEV'])]
print(df_conditions)
Output:
ID REGION_NAME SAMPLES
0 0 Region 0 -2.475624
1 1 Region 3 -7.157439
2 2 Region 2 -4.563650
3 3 Region 1 -2.199240
4 4 Region 1 5.221416
5 5 Region 2 7.175620
6 6 Region 3 -22.775366
Upvotes: 0