Reputation: 2046
I am trying to perform logistic regression in python using the following code -
from patsy import dmatrices
import numpy as np
import pandas as pd
import statsmodels.api as sm
df=pd.read_csv('C:/Users/Documents/titanic.csv')
df=df.drop(['ticket','cabin','name','parch','sibsp','fare'],axis=1) #remove columns from table
df=df.dropna() #dropping null values
formula = 'survival ~ C(pclass) + C(sex) + age'
df_train = df.iloc[ 0: 6, : ]
df_test = df.iloc[ 6: , : ]
#spliting data into dependent and independent variables
y_train,x_train = dmatrices(formula, data=df_train,return_type='dataframe')
y_test,x_test = dmatrices(formula, data=df_test,return_type='dataframe')
#instantiate the model
model = sm.Logit(y_train,x_train)
res=model.fit()
res.summary()
I am getting error at this line-
--->res=model.fit()
I have no missing values in the data set. However, my dataset is very small with just 10 entries. I am not sure what is going wrong here and how can i fix it? I am running the program in Jupyter notebook. The whole error message is given below-
---------------------------------------------------------------------------
PerfectSeparationError Traceback (most recent call last)
<ipython-input-37-c6a47ec170d5> in <module>()
19 y_test,x_test = dmatrices(formula, data=df_test,return_type='dataframe')
20 model = sm.Logit(y_train,x_train)
---> 21 res=model.fit()
22 res.summary()
C:\Program Files\Anaconda3\lib\site-packages\statsmodels\discrete\discrete_model.py in fit(self, start_params, method, maxiter, full_output, disp, callback, **kwargs)
1374 bnryfit = super(Logit, self).fit(start_params=start_params,
1375 method=method, maxiter=maxiter, full_output=full_output,
-> 1376 disp=disp, callback=callback, **kwargs)
1377
1378 discretefit = LogitResults(self, bnryfit)
C:\Program Files\Anaconda3\lib\site-packages\statsmodels\discrete\discrete_model.py in fit(self, start_params, method, maxiter, full_output, disp, callback, **kwargs)
201 mlefit = super(DiscreteModel, self).fit(start_params=start_params,
202 method=method, maxiter=maxiter, full_output=full_output,
--> 203 disp=disp, callback=callback, **kwargs)
204
205 return mlefit # up to subclasses to wrap results
C:\Program Files\Anaconda3\lib\site-packages\statsmodels\base\model.py in fit(self, start_params, method, maxiter, full_output, disp, fargs, callback, retall, skip_hessian, **kwargs)
423 callback=callback,
424 retall=retall,
--> 425 full_output=full_output)
426
427 #NOTE: this is for fit_regularized and should be generalized
C:\Program Files\Anaconda3\lib\site-packages\statsmodels\base\optimizer.py in _fit(self, objective, gradient, start_params, fargs, kwargs, hessian, method, maxiter, full_output, disp, callback, retall)
182 disp=disp, maxiter=maxiter, callback=callback,
183 retall=retall, full_output=full_output,
--> 184 hess=hessian)
185
186 # this is stupid TODO: just change this to something sane
C:\Program Files\Anaconda3\lib\site-packages\statsmodels\base\optimizer.py in _fit_newton(f, score, start_params, fargs, kwargs, disp, maxiter, callback, retall, full_output, hess, ridge_factor)
246 history.append(newparams)
247 if callback is not None:
--> 248 callback(newparams)
249 iterations += 1
250 fval = f(newparams, *fargs) # this is the negative likelihood
C:\Program Files\Anaconda3\lib\site-packages\statsmodels\discrete\discrete_model.py in _check_perfect_pred(self, params, *args)
184 np.allclose(fittedvalues - endog, 0)):
185 msg = "Perfect separation detected, results not available"
--> 186 raise PerfectSeparationError(msg)
187
188 def fit(self, start_params=None, method='newton', maxiter=35,
PerfectSeparationError: Perfect separation detected, results not available
Upvotes: 0
Views: 6499
Reputation: 838
You have perfect separation, meaning that your data is perfectly separable by a hyperplane. When this happens, the maximum likelihood estimate for your parameters is infinite, hence your error.
Example of perfect separation:
Gender Outcome
male 1
male 1
male 0
female 0
female 0
In this case, if I get a female observation, I know with 100% certainty that the outcome will be 0. That is, my data perfectly separates the outcomes. There is no uncertainty, and the numerical calculation for finding my coefficients doesn't converge.
According to your error, something similar is happening to you. With just 10 entries, you can imagine how this is likely to happen, vs having, say 1000 entries or something like that. So get more data :)
Upvotes: 4