Reputation: 3950
I have two arrays, say varx
and vary
. Both contain NaN values at various positions. However, I would like to do a linear regression on both to show how much the two arrays correlate.
This was very helpful so far.
However, using the following
slope, intercept, r_value, p_value, std_err = stats.linregress(varx, vary)
results in NaNs for every output variable. What is the most convenient way to take only valid values from both arrays as input to the linear regression? I heard about masking arrays, but am not sure how it works exactly.
Upvotes: 34
Views: 44503
Reputation: 23449
It's not relevant for linregress
because it only admits 1-D arrays anyways but if x
is 2-D and you're building a linear regression model using sklearn.linear_model.LinearRegression
/statsmodels.api.OLS
etc., then it's necessary to drop NaNs row-wise.
m = ~(np.isnan(x).any(axis=1) | np.isnan(y))
x_m, y_m = x[m], y[m]
In the above example, any()
reduces the 2-D mask into a 1-D mask, which can be used to remove rows.
A working example may look like as follows.
import numpy as np
from sklearn.linear_model import LinearRegression
# sample data
x = np.random.default_rng(0).normal(size=(100,5)) # x is shape (100,5)
y = np.random.default_rng(0).normal(size=100)
# add some NaNs
x[[10,20], [1,3]] = np.nan
y[5] = np.nan
lr = LinearRegression().fit(x, y) # <---- ValueError
m = ~(np.isnan(x).any(axis=1) | np.isnan(y))
x_m, y_m = x[m], y[m] # remove NaNs
lr = LinearRegression().fit(x_m, y_m) # <---- OK
With statsmodels
, it's even easier because its models (e.g. OLS
, Logit
, GLM
etc.) have a keyword argument missing=
that can be used to drop NaNs under the hood.
import statsmodels.api as sm
model = sm.OLS(y, x, missing='drop').fit()
model.summary()
Upvotes: 1
Reputation: 157484
You can remove NaNs using a mask:
mask = ~np.isnan(varx) & ~np.isnan(vary)
slope, intercept, r_value, p_value, std_err = stats.linregress(varx[mask], vary[mask])
Upvotes: 51