Reputation: 357
I have a df that is 700 rows x 2 columns
, below a code to reproduce a smaller version of it (with 7 rows).
df = pd.DataFrame(columns=['Private','Elite'])
df[''] = ['Abilene Christian University', 'Center for Creative Studies', 'Florida Institute of Technology',
'LaGrange College', 'Muhlenberg College', 'Saint Mary-of-the-Woods College', 'Union College KY']
df = df.set_index('')
df['Private'] = 'Yes'
df['Elite'] = 'No'
for x in df.columns:
df[x] = df[x].astype('category')
df_train = df.copy(deep=True)
Both columns have categorical values dtype = 'category'
(Either Yes or No).
According to this post: Linear regression with dummy/categorical variables the below code should work, as I am specifying C('Elite') and C('Private')
as categorical vars...
from statsmodels.formula.api import ols
fit = ols('C(Private) ~ C(Elite)', data=df_train).fit()
fit.summary()
The full error is:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Input In [213], in <cell line: 3>()
1 from statsmodels.formula.api import ols
----> 3 fit = ols('C(Private) ~ C(Elite)', data=df_train).fit()
5 fit.summary()
File ~/.local/lib/python3.10/site-packages/statsmodels/base/model.py:206, in Model.from_formula(cls, formula, data, subset, drop_cols, *args, **kwargs)
203 max_endog = cls._formula_max_endog
204 if (max_endog is not None and
205 endog.ndim > 1 and endog.shape[1] > max_endog):
--> 206 raise ValueError('endog has evaluated to an array with multiple '
207 'columns that has shape {0}. This occurs when '
208 'the variable converted to endog is non-numeric'
209 ' (e.g., bool or str).'.format(endog.shape))
210 if drop_cols is not None and len(drop_cols) > 0:
211 cols = [x for x in exog.columns if x not in drop_cols]
ValueError: endog has evaluated to an array with multiple columns that has shape (700, 2). This occurs when the variable converted to endog is non-numeric (e.g., bool or str).
In other posts, I saw some solutions with pd.to_numeric()
, I have tried the below, where No = 0
and Yes = 1
, using pd.to_numeric()
, but I still get the same error.
from statsmodels.formula.api import ols
df_train = df_train.replace('Yes', int(1))
df_train = df_train.replace('No', int(0))
for x in df_train.columns:
df_train[x] = pd.to_numeric(df_train[x])
fit = ols('C(Private) ~ C(Elite)', data=df_train).fit()
fit.summary()
Upvotes: 1
Views: 503
Reputation: 3011
For the endogenous (dependent) variable (Private
), you just need to manually define it as numeric, without using the C()
function. That function creates a 2-column design matrix that includes the intercept term. You can see what it looks like by passing the right-hand side of the formula to patsy
's dmatrix
function:
from patsy import dmatrix
dmatrix('C(Elite)', df_train)
# DesignMatrix with shape (7, 2)
# Intercept C(Elite)[T.Yes]
# 1 0
# 1 0
# 1 1
# 1 1
# 1 1
# 1 0
# 1 0
# Terms:
# 'Intercept' (column 0)
# 'C(Elite)' (column 1)
As the error says, by using C('Private')
, you are also creating a 2-column array, resulting in the ValueError
.
The independent variable Elite
does not need to be converted to a categorical data type, as statsmodels
will automatically treat it as a categorical variable because it contains strings. It will encode it using the Treatment
coding scheme, which is just dummy coding. One of the categories will be omitted, if the intercept term is used.
So, you can write the full code as follows (I modified the code to create initial column values using np.random.choice()
, rather than just using same values for all rows):
import numpy as np
import pandas as pd
df = pd.DataFrame(columns=['Private','Elite'])
df[''] = ['Abilene Christian University', 'Center for Creative Studies', 'Florida Institute of Technology',
'LaGrange College', 'Muhlenberg College', 'Saint Mary-of-the-Woods College', 'Union College KY']
df = df.set_index('')
df['Private'] = np.random.choice(['Yes','No'], 7)
df['Elite'] = np.random.choice(['Yes','No'], 7)
# map string values to numeric values
df['Private'] = df['Private'].map({'Yes':1, 'No':0})
df_train = df.copy(deep=True)
print(df_train)
# Private Elite
# Abilene Christian University 0 Yes
# Center for Creative Studies 0 No
# Florida Institute of Technology 0 Yes
# LaGrange College 0 No
# Muhlenberg College 1 Yes
# Saint Mary-of-the-Woods College 0 Yes
# Union College KY 1 Yes
The model code:
fit = ols('Private ~ C(Elite)', data=df_train).fit()
print(fit.summary())
OLS Regression Results
==============================================================================
Dep. Variable: Private R-squared: 0.160
Model: OLS Adj. R-squared: -0.008
Method: Least Squares F-statistic: 0.9524
Date: Tue, 16 Aug 2022 Prob (F-statistic): 0.374
Time: 21:53:46 Log-Likelihood: -3.7600
No. Observations: 7 AIC: 11.52
Df Residuals: 5 BIC: 11.41
Df Model: 1
Covariance Type: nonrobust
===================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------
Intercept -6.895e-17 0.346 -1.99e-16 1.000 -0.890 0.890
C(Elite)[T.Yes] 0.4000 0.410 0.976 0.374 -0.654 1.454
==============================================================================
Omnibus: nan Durbin-Watson: 2.367
Prob(Omnibus): nan Jarque-Bera (JB): 0.817
Skew: 0.483 Prob(JB): 0.665
Kurtosis: 1.633 Cond. No. 3.51
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
/usr/local/lib/python3.7/dist-packages/statsmodels/stats/stattools.py:75: ValueWarning: omni_normtest is not valid with less than 8 observations; 7 samples were given.
"samples were given." % int(n), ValueWarning)
The C(Elite)[T.Yes]
notation indicates that the No
level of the Elite
column was used as the reference category.
Upvotes: 1