Regression statistics for subsets of Pandas dataframe

Question

I have a dataframe consisting of multiple years of data with multiple environmental parameters as columns. The dataframe looks like this:

import pandas as pd
import numpy as np
from scipy import stats

Parameters= ['Temperature','Rain', 'Pressure', 'Humidity']
nrows = 365
daterange = pd.date_range('1/1/2019', periods=nrows, freq='D')
Vals = pd.DataFrame(np.random.randint(10, 150, size=(nrows, len(Parameters))), columns=Parameters) 
Vals = Vals.set_index(daterange)
    
print(Vals)

I have created a column with month names as Vals['Month'] = Vals.index.month_name().str.slice(stop=3) and I want to calculate the slope from the regression between two variables, Rain and Temperature and extract them in a dataframe. I have tried a solution as below:

pd.DataFrame.from_dict({y:stats.linregress(Vals['Temperature'], Vals['Rain'])[:2] for y, x in 
 Vals.groupby('Month')},'index').\
rename(columns={0:'Slope',1:'Intercept'})

But the output is not what I expected. I want the monthly regression statistics but the result is like this

    Slope   Intercept
Apr -0.016868   81.723291
Aug -0.016868   81.723291
Dec -0.016868   81.723291
Feb -0.016868   81.723291
Jan -0.016868   81.723291
Jul -0.016868   81.723291
Jun -0.016868   81.723291
Mar -0.016868   81.723291
May -0.016868   81.723291
Nov -0.016868   81.723291
Oct -0.016868   81.723291
Sep -0.016868   81.723291

It seems the regression is calculated from the total dataset and stored in each month index. How can I calculate the monthly statistics from the similar process?

dubbbdan · Accepted Answer

Here is a bit of code that I have used in the past. I used sklearn.LinearModel because I think its a bit easier to use, but you can change to scipy.stats if you like.

This code uses apply and does the linear regression in the function linear_model.

import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression


def linear_model(group):
    x,y = group.Temperature.values.reshape(-1,1), group.Rain.values.reshape(-1,1)
    model = LinearRegression().fit(x,y)
    m = model.coef_
    i = model.intercept_
    r_sqd = model.score(x,y)
    return (pd.Series({ 'slope':np.squeeze(m), 'intercept':np.squeeze(i), 
                      'r_sqd':np.squeeze(r_sqd)}))



Parameters= ['Temperature','Rain', 'Pressure', 'Humidity']
nrows = 365
daterange = pd.date_range('1/1/2019', periods=nrows, freq='D')
Vals = pd.DataFrame(np.random.randint(10, 150, size=(nrows, len(Parameters))), columns=Parameters) 
Vals = Vals.set_index(daterange)
   
Vals.groupby(Vals.index.month).apply(linear_model)

Result:

Vals.groupby(Vals.index.month).apply(linear_model)
Out[15]: 
                   slope           intercept     r_sqd
1   -0.06334408633973578   80.98723450432585  0.003480
2    -0.1393001910724248   85.40023995141723  0.020435
3    -0.0535505295232336   69.09958112535743  0.003481
4    0.23187299827488306  57.866651248302546  0.048741
5   -0.04813654915436082   74.31295680099751  0.001867
6    0.31976921541526526  48.496345031992746  0.089027
7    -0.1979417421554613   94.84215558468942  0.052023
8    0.22239030327077666   68.62700822940076  0.061849
9   0.054607306452220644    72.0988798639258  0.002877
10  -0.07841007716276265    91.9211204014171  0.006085
11  -0.13517307855088803  100.44769438307809  0.016045
12   -0.1967407738498068   101.7393002049148  0.042255

Your attempt was close. When you use a for loop with groupby object, you group's name and data in return. The typical convention is:

for name, group in Vals.groupby('Month'):
    #do stuff with group

Since you called x for name and y for group, you could change Vals to y, the code will produce the same result as above.

pd.DataFrame.from_dict({y:stats.linregress(x['Temperature'], x['Rain'])[:2] for y, x in 
 Vals.groupby('Month')},'index').\
rename(columns={0:'Slope',1:'Intercept'})

        Slope   Intercept
Apr  0.231873   57.866651
Aug  0.222390   68.627008
Dec -0.196741  101.739300
Feb -0.139300   85.400240
Jan -0.063344   80.987235
Jul -0.197942   94.842156
Jun  0.319769   48.496345
Mar -0.053551   69.099581
May -0.048137   74.312957
Nov -0.135173  100.447694
Oct -0.078410   91.921120
Sep  0.054607   72.098880

Regression statistics for subsets of Pandas dataframe

Answers (1)

Related Questions