Kedar
Kedar

Reputation: 23

glm function not taking correct dataset

I have just started learning R and working on dataset which has 1470 cases. Name of dataset is ABC. Using as.factor, I have converted categorical variables as factors.

Dept_1 <- as.factor(ABC$Dept)
Education_1 <- as.factor(ABC$Education)
BusinessTravel_1 <- as.factor(ABC$BusinessTravel)

After that I have split dataset into train and test.Number cases for both train and test data seems perfect. Then I use glm function using syntax below

fit = glm(attrition~Dept_1+Education_1+BusinessTravel_1,binomial(link="logit"),train)

Fit equation runs but it gets executed on entire dataset ABC with cases 1470 instead of train dataset of 1028 records.

Not able to understand what is the issue.

Upvotes: 0

Views: 165

Answers (1)

Hong Ooi
Hong Ooi

Reputation: 57696

When you do this:

Dept_1 <- as.factor(ABC$Dept)
Education_1 <- as.factor(ABC$Education)
BusinessTravel_1 <- as.factor(ABC$BusinessTravel)

you're actually creating three new variables in your global environment, not in your original data frame ABC. Because of this, when you split ABC into training and test samples, the new variables won't be affected.

When you go to fit the model, your glm call

fit = glm(attrition~Dept_1+Education_1+BusinessTravel_1,binomial(link="logit"),train)

will look for the variables listed in the formula. It won't find them in the train dataset, but it will find them in the global environment. That's why they have the original length.

What you probably wanted is

ABC$Dept_1 <- as.factor(ABC$Dept)
ABC$Education_1 <- as.factor(ABC$Education)
ABC$BusinessTravel_1 <- as.factor(ABC$BusinessTravel)

which will create the variables in the data frame ABC.

Upvotes: 3

Related Questions