cowboy
cowboy

Reputation: 661

Logistic Regression in R with Categorical Predictors

I have created a model for predicting sales achievement, however the results of the prediction are not close to real results. All of the predictors are categorical so I am wondering if that is the issue? Below is my code i am using.

setwd("c:/Users/xxxxx/Desktop/SalesPredict/")
trainData <- read.csv("train3.csv", header = TRUE)
testData <- read.csv("test3.csv", header = TRUE)
train.column.types <- c('character', # Prospect
                    'factor', # Sls_Office
                    'factor', # Month
                    'factor', # Sls_Rep
                    'factor', # Fin_Type
                    'factor', # Competitor
                    'integer', # Prospect_Size
                    'factor' , # Prospect_Segment
                    'factor' # Sold
)

test.column.types <- train.column.types[-9]

train.glm <- glm(Sold ~ Competitor + Prospect_Segment + Sls_Office + Month + Fin_Type  , family = binomial(link=logit), data = trainData)

summary(train.glm)

p.hats <- predict.glm(train.glm, newdata = testData, type = "response")

Sold <- vector()
for(i in 1:length(p.hats)) {
  if(p.hats[i] > .5) {
Sold[i] <- 1
 } else {
Sold[i] <- 0
 }
}

Enrolled_Segment = testData[8]
Month = testData[3]
Enrolled_EE = testData[7]
predict.sub <- cbind(Prospect_Segment, Sold, Month,Prospect_Size)
colnames(predict.sub) <- c("Segment","Predicted Disposition","Month","Size")
write.csv(predict.sub, file = "SalesPredictions.csv", row.names = FALSE)

Do i need to convert the categorical variables to something other? The train data set has approximately 1650 rows - which are real actual results and the test (which i am trying to predict outcome of) has approximately 540 rows. These 540 are real also, so i know what the approximate expected outcome should be. In the train data, sold = 1 approximately 11% of the time. In the test data, the prediction results in sold = "1" 0 times. Any help or direction on how to improve this would be appreciated.

Upvotes: 2

Views: 7905

Answers (1)

MrFlick
MrFlick

Reputation: 206167

If your question is do I need to convert factor variables to something else when using glm, the answer is no. If the variable truly represents a categorical variable, keeping them as a factor is the correct thing to do. R by default will use reference level coding to perform the statistical analysis as requested.

If you are not getting the results you expect, the variable class is not the problem.As @josiber pointed out, it may simply be a shortcoming of logistic regression in the case of unbalanced data. However, since you did not produce enough data to make your example reproducible, it is hard to be absolutely certain of that.

Upvotes: 1

Related Questions