Reputation:

simulate data from regression line in python

If I have a regression line and an r squared is there a simple numpy (or some other python library) command to randomly draw, say, y values for an x that are consistent with the regression? The same way you could just draw a random value from a distribution?

Thanks!

edit: I have the equation for my regression line and an r^2 value. That r^2 value should provide some information about the distribution of data points around my line, no? If I just call this y=random.gauss()*x+b haven't I lost the information in my r^2? Or would this be incorporated into the stdv, if so how? Sorry, I just haven't worked with regression much before.

Upvotes: 1

Answers (2)

S.Lott

Reputation: 391952

If I just call this y=random.gauss()*x+b haven't I lost the information in my r^2?

Clearly.

However.

Reading the documentation, we see that random.gauss takes two arguments. A mean and a standard deviation.

The mean must be zero.

The standard deviation, however, needs to be adjusted to match your r**2.

When r**2 == 0, the standard deviation is high. It should produce any value in the original range of the sample data.

As r**2 approaches 1, the standard deviation gets smaller.

How to compute the standard deviation value that reproduces your r**2?

Brute Force.

m, b = regression_model( some_data )
deviations = list( y - m*x+b for x, y in some_data )

This list of deviations is the essential ingredient in the standard deviation formula.

sd = math.sqrt( sum( d**2 for d in deviations ) / (len(some_data)-1) )

Now you can use random.gauss(0,sd) to reproduce the deviations in your original data.

See @PaulHiemstra's answer for a proper theoretical approach.

Upvotes: 1

Paul Hiemstra

Reputation: 60964

Luckily there is no need for brute force :). To get a relationship between the R^2 and the standard deviation of the residuals it is easiest to start at the definition of the R^2:

R^2 = SSR / SST    (1)

where SSR is the sums of squares of the regression, i.e. (sum((y'-mean(y))^2) where y' are the values on the regression line, and SST is the total sums of squares, i.e. sum((y - mean(y))^2) where y are the observations. So effectively the R^2 is the fraction of between the total amount of variance and the amount of variance explained by the regression model (or line). For our purpose we need to re express SSR as SST - SSE, where SSE are the sums of squares between the regression line and the observations. SSE is variance which is not explained by the regression model. Rewriting (1):

R^2 = (SST - SSE) / SST = 1 - SSE / SST

expressing for SSE:

SSE = (1 - R^2) SST

If we note that to go for sums of squares to variance we need to divide by N-1 this becomes:

VAR_E = (1 - R^2) VAR_T

to get the standard deviation of the residuals:

SD_E = sqrt((1 - R^2) VAR_T)

and taking the VAR out of the parentheses:

SD_E = sqrt(1 - R^2) SD_T

So you need the R^2 and the total standard deviation of the dataset. To verify this, check any introductory statistics book.

Upvotes: 2

simulate data from regression line in python

Answers (2)

Related Questions