Scaleable Python normal distribution from pandas DataFrame

Question

I have a pandas dataframe (code below) that has the mean and std deviation by day of week and quarter. What i'd like to do is extract each mean and std deviation by day of week, create a random normal sample from those two values then plot it.

np.random.seed(42)
day_of_week=['mon', 'tues', 'wed', 'thur', 'fri', 'sat','sun']
year=[2017]
qtr=[1,2,3,4]
mean=np.random.uniform(5,30,len(day_of_week)*len(qtr))
std=np.random.uniform(1,10,len(day_of_week)*len(qtr))

dat=pd.DataFrame({'year':year*(len(day_of_week)*len(qtr)),
             'qtr':qtr*len(day_of_week),
             'day_of_week':day_of_week*len(qtr),
             'mean':mean,
             'std': std})
dowuq=dat.day_of_week.unique()

Right now i have a solution to the above which works but this method isn't very scaleable. If I wanted to add in more and more columns i.e another year or break it out by week this would not but efficient. I'm fairly new to python so any help is appreciated.

Code that works but not scaleable:

plt.style.use('fivethirtyeight')
for w in dowuq:
    datsand=dat[dat['day_of_week']==''+str(w)+''][0:4]
    mu=datsand.iloc[0]['mean']
    sigma=datsand.iloc[0]['std']
    mu2=datsand.iloc[1]['mean']
    sigma2=datsand.iloc[1]['std']
    mu3=datsand.iloc[2]['mean']
    sigma3=datsand.iloc[2]['std']
    mu4=datsand.iloc[3]['mean']
    sigma4=datsand.iloc[3]['std']             
    s1=np.random.normal(mu, sigma, 2000)
    s2=np.random.normal(mu2, sigma2, 2000)
    s3=np.random.normal(mu3, sigma3, 2000)
    s4=np.random.normal(mu4, sigma4, 2000)
    sns.kdeplot(s1, bw='scott', label='Q1')
    sns.kdeplot(s2, bw='scott', label='Q2')
    sns.kdeplot(s3, bw='scott', label='Q3')
    sns.kdeplot(s4, bw='scott', label='Q4')
    plt.title(''+str(w)+' in 2017')
    plt.ylabel('Density')
    plt.xlabel('Random')
    plt.xticks(rotation=15)
    plt.show()

asongtoruin · Accepted Answer

You should probably be using groupby, which allows you to group a dataframe. For the time being we group on 'day' only, but you could extend this in future if required.

We can also change to using iterrows to loop over all of the listed rows:

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

np.random.seed(42)
day_of_week = ['mon', 'tues', 'wed', 'thur', 'fri', 'sat', 'sun']
year = [2017]
qtr = [1, 2, 3, 4]
mean = np.random.uniform(5, 30, len(day_of_week) * len(qtr))
std = np.random.uniform(1, 10, len(day_of_week) * len(qtr))

dat = pd.DataFrame({'year': year * (len(day_of_week) * len(qtr)),
                    'qtr': qtr * len(day_of_week),
                    'day_of_week': day_of_week * len(qtr),
                    'mean': mean,
                    'std': std})

# Group by day of the week
for day, values in dat.groupby('day_of_week'):
    # Loop over rows for each day of the week
    for i, r in values.iterrows():
        cur_dist = np.random.normal(r['mean'], r['std'], 2000)
        sns.kdeplot(cur_dist, bw='scott', label='{}_Q{}'.format(day, r['qtr']))
    plt.title('{} in 2017'.format(day))
    plt.ylabel('Density')
    plt.xlabel('Random')
    plt.xticks(rotation=15)
    plt.show()
    plt.clf()

Scaleable Python normal distribution from pandas DataFrame

Answers (1)

Related Questions