Reputation: 1743
Please forgive my ignorance. All I'm trying to do is add a squared term to my regression without going through the trouble of defining a new column in my dataframe. I'm using statsmodels.formula.api (as stats) because the format is similar to R, which I am more familiar with.
hours_model = stats.ols(formula='act_hours ~ h_hours + C(month) + trend', data = df).fit()
The above works as expected.
hours_model = stats.ols(formula='act_hours ~ h_hours + h_hours**2 + C(month) + trend', data = df).fit()
This omits h_hours**2 and returns the same output as the line above.
I've also tried: h_hours^2, math.pow(h_hours,2), and poly(h_hours,2) All throw errors.
Any help would be appreciated.
Upvotes: 8
Views: 10088
Reputation: 982
This is based on the answer by Mohammad-Reza Malekpour, but removes the intercept column from the Vandermonde matrix.
import numpy as np
def poly(x, degree):
return np.vander(x, degree + 1, increasing=True)[:, 1:]
Example:
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
n = 30
np.random.seed(42)
x = np.random.uniform(0, 1, size=n)
y = 1 + 2 * x + 3 * x**2 + 4 * x**3 + np.random.normal(0, 0.001, size=n)
df = pd.DataFrame({"x": x, "y": y})
results = smf.ols(formula="y ~ poly(x, 3)", data=df).fit()
print(results.summary().tables[1])
=================================================================================
coef std err t P>|t| [0.025 0.975]
---------------------------------------------------------------------------------
Intercept 1.0009 0.001 1599.564 0.000 1.000 1.002
poly(x, 3)[0] 1.9935 0.006 361.673 0.000 1.982 2.005
poly(x, 3)[1] 3.0124 0.013 232.711 0.000 2.986 3.039
poly(x, 3)[2] 3.9919 0.009 463.711 0.000 3.974 4.010
=================================================================================
Compared with:
results = smf.ols(formula="y ~ x + I(x**2) + I(x**3)", data=df).fit()
print(results.summary().tables[1])
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 1.0009 0.001 1599.564 0.000 1.000 1.002
x 1.9935 0.006 361.673 0.000 1.982 2.005
I(x ** 2) 3.0124 0.013 232.711 0.000 2.986 3.039
I(x ** 3) 3.9919 0.009 463.711 0.000 3.974 4.010
==============================================================================
Upvotes: 1
Reputation: 965
Currently, although the statsmodels formula API (in fact Patsy library) doesn't support poly(variable, degree)
function as in R, NumPy's vander(variable, degree+1)
can do the job. However, pay attention that np.vander()
produces the Vandermonde matrix which means you get intercept column too! Let's see this function in an example:
>> x = np.array([1, 2, 3, 5])
>> np.vander(x, 4, increasing=True)
array([[ 1, 1, 1, 1],
[ 1, 2, 4, 8],
[ 1, 3, 9, 27],
[ 1, 5, 25, 125]])
So, you need to remove Patsy's internal intercept by adding -1
to your formula:
hours_model = stats.ols(formula='act_hours ~ np.vander(h_hours, 3, increasing=True) - 1', data = df).fit()
Note that you need to pass your_desired_degree + 1
because the first column is x^0=1.
Upvotes: 4
Reputation: 46888
You can try using I()
like in R:
import statsmodels.formula.api as smf
np.random.seed(0)
df = pd.DataFrame({'act_hours':np.random.uniform(1,4,100),'h_hours':np.random.uniform(1,4,100),
'month':np.random.randint(0,3,100),'trend':np.random.uniform(0,2,100)})
model = 'act_hours ~ h_hours + I(h_hours**2)'
hours_model = smf.ols(formula = model, data = df)
hours_model.exog[:5,]
array([[ 1. , 3.03344961, 9.20181654],
[ 1. , 1.81002392, 3.27618659],
[ 1. , 3.20558207, 10.27575638],
[ 1. , 3.88656564, 15.10539244],
[ 1. , 1.74625943, 3.049422 ]])
Upvotes: 27