Reputation: 307
I would like to calculate the standard error of linear regression coefficient using bootstrap technique (100 resamples) but the result I got is zero, which is not normal. I think something is wrong with the bootstrap part of the code. Do you know how to fix my code?
x, y = np.genfromtxt("input.txt", unpack=True)
#regression part
slope, intercept, r_value, p_value, std_err = stats.linregress(x,y)
print std_err
#bootstrap part of the code
A = np.random.choice(x, size=100, replace=True)
B = np.random.choice(y, size=100, replace=True)
slope2, intercept2, r_value2, p_value2, std_err2 = stats.linregress(A,B)
print std_err2
input.txt:
-1.08 -1.07
-2.62 -2.56
-2.84 -2.79
-2.22 -2.16
-3.47 -3.55
-2.81 -2.79
-2.86 -2.71
-3.41 -3.42
-4.18 -4.21
-3.50 -3.48
-5.67 -5.55
-6.83 -6.95
-6.13 -6.13
-8.34 -8.19
-7.82 -7.83
-9.86 -9.58
-8.67 -8.62
-9.81 -9.81
-8.39 -8.30
Upvotes: 1
Views: 3881
Reputation: 246
I had no issues with your above code running in Python 3.6.1. Maybe check that your scipy version is current?
from scipy import stats
import numpy as np
x, y = np.genfromtxt("./input.txt", unpack=True)
slope_1, intercept_1, r_val_1, p_val_1, stderr_1 = stats.linregress(x, y)
print(slope_1) # 0.9913080927081567
print(stderr_1) # 0.007414734102169809
A = np.random.choice(x, size=100, replace=True)
B = np.random.choice(y, size=100, replace=True)
slope_2, incercept_2, r_val_2, p_val_2, stderr_2 = stats.linregress(A, B)
print(slope_2) # 0.11429903085322253
print(stderr_2) # 0.10158283281966374
Correctly Bootstrapping the Data
The correct way to do this would be to use the resample
method from sklearn.utils
. This method handles the data in a consistent array format. Since your data is an x, y pair, the y value is dependent on your x value. If you randomly sample x and y independently you lose that dependency and your resampled data will not accurately represent your population.
from scipy import stats
from sklearn.utils import resample
import numpy as np
x, y = np.genfromtxt("./input.txt", unpack=True)
slope_1, intercept_1, r_val_1, p_val_1, stderr_1 = stats.linregress(x, y)
print(slope_1) # 0.9913080927081567
print(stderr_1) # 0.007414734102169809
A, B = resample(x, y, n_samples=100) # defaults to w/ replacement
slope_2, incercept_2, r_val_2, p_val_2, stderr_2 = stats.linregress(A, B)
print(slope_2) # 0.9864339054638176
print(stderr_2) # 0.002669659193615103
Upvotes: 1