pyharry
pyharry

Reputation: 25

How to use all variables for Logistic Regression in Python from Statsmodel (equivalent to R glm)

I would like to conduct Logistic Regression in Python.

My reference in R is

model_1 <- glm(status_1 ~., data = X_train, family=binomial)
summary(model_1)

I'm trying to convert this into Python. But not so sure how to grab all variables.

import statsmodels.api as sm
model = sm.formula.glm("status_1 ~ ", family=sm.families.Binomial(), data=train).fit()
print(model.summary())

How can I use all variables, which means what do I need to input after status_1?

Upvotes: 1

Views: 1502

Answers (2)

Billy
Billy

Reputation: 41

According to your question, I understand that you have binomial data and you want to create a Generalised Linear Model using logit as link function. Also, as you can see in this thread (jseabold's answer) the feature you mentioned doesn't exist in patsy yet. So I will show you how to create a Generalised Linear Model when you have Binomial data by using sm.GLM() function.

#Imports

import numpy as np

import pandas as pd

import statsmodels.api as sm

#Suppose that your train data is in a dataframe called data_train

#Let's split the data into dependent and independent variables

In this phase I want to mention that our dependent variable should be a 2d array with two columns as the help for the statsmodels GLM function suggests:

Binomial family models accept a 2d array with two columns. If supplied, each observation is expected to be [success, failure].

#Let's create the array which holds the dependent variable

y = data_train[["the name of the column of successes","the name of the column of failures"]]

#Let's create the array which holds the independent variables

X = data_train.drop(columns = ["the name of the column of successes","the name of the column of failures"])

#We have to add a constant in the array of the independent variables because by default constants
#aren't included in the model

X = sm.add_constant(X)

#It's time to create our model

logit_model = sm.GLM(
    endog = y,
    exog = X,
    family = sm.families.Binomial(link=sm.families.links.Logit())).fit())

#Let's see some information about our model

logit_model.summary()

Upvotes: 0

vtasca
vtasca

Reputation: 1760

statsmodels makes it pretty straightforward to do logistic regression, as such:

import statsmodels.api as sm

Xtrain = df[['gmat', 'gpa', 'work_experience']]
ytrain = df[['admitted']]

log_reg = sm.Logit(ytrain, Xtrain).fit()

Where gmat, gpa and work_experience are your independent variables.

Upvotes: 2

Related Questions