Amandeep Rathee
Amandeep Rathee

Reputation: 77

Applying logistic regression in titanic dataset

I have the famous titanic data set from Kaggle's website. I want to predict the survival of the passengers using logistic regression. I am using the glm() function in R. I first divide my data frame(total rows = 891) into two data frames i.e. train(from row 1 to 800) and test(from row 801 to 891). The code is as follows

`
>> data <- read.csv("train.csv", stringsAsFactors = FALSE)

>> names(data)

 `[1] "PassengerId" "Survived"    "Pclass"      "Name"        "Sex"             "Age"         "SibSp"      
 [8] "Parch"       "Ticket"      "Fare"        "Cabin"       "Embarked" `  

#Replacing NA values in Age column with mean value of non NA values of Age.
>> data$Age[is.na(data$Age)] <- mean(data$Age, na.rm = TRUE)

#Converting sex into binary values. 1 for males and 0 for females.
>> sexcode <- ifelse(data$Sex == "male",1,0)
#dividing data into train and test data frames
>> train <- data[1:800,]

>> test <- data[801:891,]
#setting up the model using glm()

>> model <- glm(Survived~sexcode[1:800]+Age+Pclass+Fare,family=binomial(link='logit'),data=train, control = list(maxit = 50))

#creating a data frame
>> newtest <- data.frame(sexcode[801:891],test$Age,test$Pclass,test$Fare)

>> prediction <- predict(model,newdata = newtest,type='response')

`

And as I run the last line of code

prediction <- predict(model,newdata = newtest,type='response')

I get the following error

Error in eval(expr, envir, enclos) : object 'Age' not found

Can anyone please explain what the problem is. I have checked the newteset variable and there doesn't seem to be any problem in that.

Here is the link to titanic data set https://www.kaggle.com/c/titanic/download/train.csv

Upvotes: 1

Views: 1639

Answers (1)

Kabulan0lak
Kabulan0lak

Reputation: 2136

First, you should add the sexcode directly to the dataframe:

data$sexcode <- ifelse(data$Sex == "male",1,0)

Then, as I commented, you have a problem in your columns names in the newtest dataframe because you create it manually. You can use directly the test dataframe.

So here is your full working code:

  data <- read.csv("train.csv", stringsAsFactors = FALSE)
  data$Age[is.na(data$Age)] <- mean(data$Age, na.rm = TRUE)
  data$sexcode <- ifelse(data$Sex == "male",1,0)

  train <- data[1:800,]
  test <- data[801:891,]

  model <- glm(Survived~sexcode+Age+Pclass+Fare,family=binomial(link='logit'),data=train, control = list(maxit = 50))

  prediction <- predict(model,newdata = test,type='response')

Upvotes: 3

Related Questions