Tom
Tom

Reputation: 243

Linear Regression Residuals - Should I "standardise" the results and how to do this

I am a biologist. I want to copy an approach that I read in a paper: "To permit associations with death rates to be investigated independently from weight, residuals for death rates were calculated by subtracting the predicted from observed values".

I have a set of death rates (that range from about 0.1 to 0.5), a set of body weights (that range from about 2 to 80), and I want to calculate residuals for the death rates after accounting for body weight.

I wrote this code:

import scipy
from scipy import stats
import sys


# This reads in the weight and mortality data to two lists. 
Weight = []
Mortality = []
for line in open(sys.argv[1]):
        line = line.strip().split()
        Weight.append(float(line[-2]))
        Mortality.append(float(line[-1]))

# This calculates the regression equation.
slope, intercept, r_value, p_value, std_err = scipy.stats.linregress(Mortality,Weight)

# This calculates the predicted value for each observed value
obs_values = Mortality
pred_values = []
for i in obs_values:
    pred_i = float(i) * float(slope) + float(intercept)
    pred_values.append(pred_i)

# This prints the residual for each pair of observations
for obs_v,pred_v in zip(obs_values,pred_values):
    Residual = str(obs_v - pred_v)
    print Residual

My question is, when I run this code, some of my residuals seem quite large:

> Sample1 839.710240214 
> Sample2 325.787250084 
> Sample3 -41.3006000084
> Sample4 -70.6676280159
> Sample5 267.05319407
> Sample6 399.204820103
> Sample7 560.723474144
> Sample8 766.292670196
> Sample9 267.05319407
> Sample10 2.7499420027

I'm wondering, do these results seem "normal"/ should they be "standardised" in some way/ did I do something wrong to obtain residuals for mortality rates after accounting for weight?

I would appreciate simple "plain english" answers with possibly code snippets if it were possible, as I am not a statistics expert!

Many thanks

Upvotes: 4

Views: 4699

Answers (2)

Tom
Tom

Reputation: 243

I know I'm not meant to ask a follow up question here, if anyone could tell me how I can continue a discussion of my original question (with code and without a character count) without clicking "answer question", I will gladly move this text to that section; I apologise.

My last question was how "To permit associations with death rates to be investigated independently from weight". My next question is just out of curiosity, if were to expand on this, say if I wanted to examine death rates, independently from weight and height?

I've written this code, for my data these residuals do add up to 0, but I just wanted to check with the experts that this is the way that I would go about this for future reference:

import numpy as np
import statsmodels.formula.api as smf
import sys

dat = np.loadtxt(sys.argv[1],dtype={"names":("SpeciesName","Mortality","Height","Weight"),"formats":("S40","f4","f4","f4")})
mymodel = smf.ols("Mortality~Height+Weight",data=dat).fit()
Residues = list(mymodel.resid_pearson)
SpeciesList = list(dat["SpeciesName"])
for species,residue in zip(SpeciesList,Residues):
    print species + "\t" + str(residue)

Once again, apologies if I have written this in the wrong section; I didn't feel that it was a new question, and as a comment I couldn't add code; I will gladly make this a new question if that is more appropriate.

Upvotes: 1

Chickenmarkus
Chickenmarkus

Reputation: 1161

Take a look into the documentation of scipy.stats.linregess(): The first argument is x, the abscissa, and the second is y, your observed value. So if obs_values = Mortality should be the observed values you have to permute the two arguments of linear regression and have to calculate the predicted values based on the Weight as x (not Mortality as y):

# This calculates the regression equation.
slope, intercept, r_value, p_value, std_err = scipy.stats.linregress(x=Weight, y=Mortality)

# This calculates the predicted value for each observed value
obs_values = Mortality
pred_values = []
for i in Weight:
    pred_i = float(i) * float(slope) + float(intercept)
    pred_values.append(pred_i)

Additional you can reduce (and speed up) your code drastically by using numpy (scipy uses it anyway).

import numpy as np
from scipy import stats
import sys

# This reads in the weight and mortality data to two arrays.
arr = np.loadtxt(sys.argv[1])
Weight = arr[:,-2]
Mortality = arr[:,-1]

# This calculates the regression equation.
slope, intercept, r_value, p_value, std_err = stats.linregress(x=Weight,y=Mortality)

# This calculates the predicted value for each observed value
obs_values = Mortality
pred_values = slope * Weight + intercept

# This prints the residual for each pair of observations
Residual = obs_values - pred_values
print(Residuals)

Upvotes: 4

Related Questions