Reputation:
If I have a regression line and an r squared is there a simple numpy (or some other python library) command to randomly draw, say, y values for an x that are consistent with the regression? The same way you could just draw a random value from a distribution?
Thanks!
edit: I have the equation for my regression line and an r^2 value. That r^2 value should provide some information about the distribution of data points around my line, no? If I just call this y=random.gauss()*x+b haven't I lost the information in my r^2? Or would this be incorporated into the stdv, if so how? Sorry, I just haven't worked with regression much before.
Upvotes: 1
Views: 1966
Reputation: 391952
If I just call this y=random.gauss()*x+b haven't I lost the information in my r^2?
Clearly.
However.
Reading the documentation, we see that random.gauss takes two arguments. A mean and a standard deviation.
The mean must be zero.
The standard deviation, however, needs to be adjusted to match your r**2.
When r**2 == 0, the standard deviation is high. It should produce any value in the original range of the sample data.
As r**2 approaches 1, the standard deviation gets smaller.
How to compute the standard deviation value that reproduces your r**2?
Brute Force.
m, b = regression_model( some_data )
deviations = list( y - m*x+b for x, y in some_data )
This list of deviations is the essential ingredient in the standard deviation formula.
sd = math.sqrt( sum( d**2 for d in deviations ) / (len(some_data)-1) )
Now you can use random.gauss(0,sd)
to reproduce the deviations in your original data.
See @PaulHiemstra's answer for a proper theoretical approach.
Upvotes: 1
Reputation: 60964
Luckily there is no need for brute force :). To get a relationship between the R^2
and the standard deviation of the residuals it is easiest to start at the definition of the R^2
:
R^2 = SSR / SST (1)
where SSR
is the sums of squares of the regression, i.e. (sum((y'-mean(y))^2)
where y'
are the values on the regression line, and SST is the total sums of squares, i.e. sum((y - mean(y))^2)
where y
are the observations. So effectively the R^2
is the fraction of between the total amount of variance and the amount of variance explained by the regression model (or line). For our purpose we need to re express SSR
as SST - SSE
, where SSE
are the sums of squares between the regression line and the observations. SSE
is variance which is not explained by the regression model. Rewriting (1):
R^2 = (SST - SSE) / SST = 1 - SSE / SST
expressing for SSE
:
SSE = (1 - R^2) SST
If we note that to go for sums of squares to variance we need to divide by N-1
this becomes:
VAR_E = (1 - R^2) VAR_T
to get the standard deviation of the residuals:
SD_E = sqrt((1 - R^2) VAR_T)
and taking the VAR out of the parentheses:
SD_E = sqrt(1 - R^2) SD_T
So you need the R^2
and the total standard deviation of the dataset. To verify this, check any introductory statistics book.
Upvotes: 2