granger
granger

Reputation: 85

Create a Pandas dataframe of normal estimates based on varying row requirements

I get that this will create a dataframe of a single sample:

samples = np.random.normal(loc=df_avgs['AVERAGE'][region], scale=df_avgs['STDEV'][region], size=1)

But I want to create a sample for each row, based on a condition. For instance, I have a df of means, stdev and a df of conditions.

df_avgs

REGION AVERAGE STDEV
0 -1.61 7.75
1 2.87 8.38
2 3.61 7.61
3 -10.26 9.19

df_conditions

ID REGION_NAME
0 Region 0
1 Region 3
2 Region 2
3 Region 1
4 Region 1
5 Region 2
6 Region 3

How do I create a df of length(df_conditions) or just add a column to df_conditions, with samples based on the region?

Upvotes: 0

Views: 61

Answers (2)

ouroboros1
ouroboros1

Reputation: 14369

Once you have merged the dfs, the means and stds can be passed as arrays:

df_conditions = (
    df_conditions.assign(
        REGION=df_conditions['REGION_NAME'].str.extract(r'(\d+)$').astype(int)
        )
    .merge(df_avgs, on='REGION')
    .assign(
        SAMPLES = lambda x: np.random.default_rng().normal(
            loc=x.pop('AVERAGE'), 
            scale=x.pop('STDEV'), 
            size=len(x.pop('REGION'))
            )
        )
    )

Output:

   ID REGION_NAME    SAMPLES
0   0    Region 0 -13.940460
1   1    Region 3 -14.353592
2   2    Region 2  -4.011282
3   3    Region 1   0.664078
4   4    Region 1  -1.276447
5   5    Region 2   5.210611
6   6    Region 3 -15.929978

Explanation / intermediate

  • Use Series.str.extract on df_conditions['REGION_NAME'] to extract the digits and assign as 'REGION' with df.assign.
  • Use df.merge to add columns 'AVERAGE' and 'STDEV' from df_avgs.
# merged

   ID REGION_NAME  REGION  AVERAGE  STDEV
0   0    Region 0       0    -1.61   7.75
1   1    Region 3       3   -10.26   9.19
2   2    Region 2       2     3.61   7.61
3   3    Region 1       1     2.87   8.38
4   4    Region 1       1     2.87   8.38
5   5    Region 2       2     3.61   7.61
6   6    Region 3       3   -10.26   9.19

Data used

import pandas as pd
import numpy as np

data = {'REGION': {0: 0, 1: 1, 2: 2, 3: 3}, 
        'AVERAGE': {0: -1.61, 1: 2.87, 2: 3.61, 3: -10.26}, 
        'STDEV': {0: 7.75, 1: 8.38, 2: 7.61, 3: 9.19}
        }

df_avgs = pd.DataFrame(data)

data2 = {'ID': {0: 0, 1: 1, 2: 2, 3: 3, 4: 4, 5: 5, 6: 6}, 
         'REGION_NAME': {0: 'Region 0', 1: 'Region 3', 2: 'Region 2', 3: 'Region 1', 
                         4: 'Region 1', 5: 'Region 2', 6: 'Region 3'}}
df_conditions = pd.DataFrame(data2)

Upvotes: 0

Scott Boston
Scott Boston

Reputation: 153510

IIUC, you can merge the two dataframes together and then, assign the values using list comprehension with a zip of two dataframe columns:

df_zip = df_conditions.assign(REGION=df_conditions['REGION_NAME'].str.extract('([0-9])').astype(int)).merge(df_avgs)

df_conditions['SAMPLES'] = [np.random.normal(loc=l, scale=s, size=1)[0] for l, s in zip(df_zip['AVERAGE'], df_zip['STDEV'])]

print(df_conditions)

Output:

   ID REGION_NAME    SAMPLES
0   0    Region 0  -2.475624
1   1    Region 3  -7.157439
2   2    Region 2  -4.563650
3   3    Region 1  -2.199240
4   4    Region 1   5.221416
5   5    Region 2   7.175620
6   6    Region 3 -22.775366

Upvotes: 0

Related Questions