Reputation: 19
I have a dataframe consisting of multiple years of data with multiple environmental parameters as columns. The dataframe looks like this:
import pandas as pd
import numpy as np
from scipy import stats
Parameters= ['Temperature','Rain', 'Pressure', 'Humidity']
nrows = 365
daterange = pd.date_range('1/1/2019', periods=nrows, freq='D')
Vals = pd.DataFrame(np.random.randint(10, 150, size=(nrows, len(Parameters))), columns=Parameters)
Vals = Vals.set_index(daterange)
print(Vals)
I have created a column with month names as Vals['Month'] = Vals.index.month_name().str.slice(stop=3)
and I want to calculate the slope from the regression between two variables, Rain and Temperature
and extract them in a dataframe. I have tried a solution as below:
pd.DataFrame.from_dict({y:stats.linregress(Vals['Temperature'], Vals['Rain'])[:2] for y, x in
Vals.groupby('Month')},'index').\
rename(columns={0:'Slope',1:'Intercept'})
But the output is not what I expected. I want the monthly regression statistics but the result is like this
Slope Intercept
Apr -0.016868 81.723291
Aug -0.016868 81.723291
Dec -0.016868 81.723291
Feb -0.016868 81.723291
Jan -0.016868 81.723291
Jul -0.016868 81.723291
Jun -0.016868 81.723291
Mar -0.016868 81.723291
May -0.016868 81.723291
Nov -0.016868 81.723291
Oct -0.016868 81.723291
Sep -0.016868 81.723291
It seems the regression is calculated from the total dataset and stored in each month index. How can I calculate the monthly statistics from the similar process?
Upvotes: 1
Views: 599
Reputation: 2740
Here is a bit of code that I have used in the past. I used sklearn.LinearModel
because I think its a bit easier to use, but you can change to scipy.stats if you like.
This code uses apply
and does the linear regression in the function linear_model
.
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
def linear_model(group):
x,y = group.Temperature.values.reshape(-1,1), group.Rain.values.reshape(-1,1)
model = LinearRegression().fit(x,y)
m = model.coef_
i = model.intercept_
r_sqd = model.score(x,y)
return (pd.Series({ 'slope':np.squeeze(m), 'intercept':np.squeeze(i),
'r_sqd':np.squeeze(r_sqd)}))
Parameters= ['Temperature','Rain', 'Pressure', 'Humidity']
nrows = 365
daterange = pd.date_range('1/1/2019', periods=nrows, freq='D')
Vals = pd.DataFrame(np.random.randint(10, 150, size=(nrows, len(Parameters))), columns=Parameters)
Vals = Vals.set_index(daterange)
Vals.groupby(Vals.index.month).apply(linear_model)
Result:
Vals.groupby(Vals.index.month).apply(linear_model)
Out[15]:
slope intercept r_sqd
1 -0.06334408633973578 80.98723450432585 0.003480
2 -0.1393001910724248 85.40023995141723 0.020435
3 -0.0535505295232336 69.09958112535743 0.003481
4 0.23187299827488306 57.866651248302546 0.048741
5 -0.04813654915436082 74.31295680099751 0.001867
6 0.31976921541526526 48.496345031992746 0.089027
7 -0.1979417421554613 94.84215558468942 0.052023
8 0.22239030327077666 68.62700822940076 0.061849
9 0.054607306452220644 72.0988798639258 0.002877
10 -0.07841007716276265 91.9211204014171 0.006085
11 -0.13517307855088803 100.44769438307809 0.016045
12 -0.1967407738498068 101.7393002049148 0.042255
Your attempt was close. When you use a for loop with groupby object, you group's name and data in return. The typical convention is:
for name, group in Vals.groupby('Month'):
#do stuff with group
Since you called x
for name
and y
for group
, you could change Vals
to y
, the code will produce the same result as above.
pd.DataFrame.from_dict({y:stats.linregress(x['Temperature'], x['Rain'])[:2] for y, x in
Vals.groupby('Month')},'index').\
rename(columns={0:'Slope',1:'Intercept'})
Slope Intercept
Apr 0.231873 57.866651
Aug 0.222390 68.627008
Dec -0.196741 101.739300
Feb -0.139300 85.400240
Jan -0.063344 80.987235
Jul -0.197942 94.842156
Jun 0.319769 48.496345
Mar -0.053551 69.099581
May -0.048137 74.312957
Nov -0.135173 100.447694
Oct -0.078410 91.921120
Sep 0.054607 72.098880
Upvotes: 1