Reputation: 23
I have just started learning R and working on dataset which has 1470 cases. Name of dataset is ABC. Using as.factor, I have converted categorical variables as factors.
Dept_1 <- as.factor(ABC$Dept)
Education_1 <- as.factor(ABC$Education)
BusinessTravel_1 <- as.factor(ABC$BusinessTravel)
After that I have split dataset into train and test.Number cases for both train and test data seems perfect. Then I use glm function using syntax below
fit = glm(attrition~Dept_1+Education_1+BusinessTravel_1,binomial(link="logit"),train)
Fit equation runs but it gets executed on entire dataset ABC with cases 1470 instead of train dataset of 1028 records.
Not able to understand what is the issue.
Upvotes: 0
Views: 165
Reputation: 57696
When you do this:
Dept_1 <- as.factor(ABC$Dept)
Education_1 <- as.factor(ABC$Education)
BusinessTravel_1 <- as.factor(ABC$BusinessTravel)
you're actually creating three new variables in your global environment, not in your original data frame ABC
. Because of this, when you split ABC
into training and test samples, the new variables won't be affected.
When you go to fit the model, your glm
call
fit = glm(attrition~Dept_1+Education_1+BusinessTravel_1,binomial(link="logit"),train)
will look for the variables listed in the formula. It won't find them in the train
dataset, but it will find them in the global environment. That's why they have the original length.
What you probably wanted is
ABC$Dept_1 <- as.factor(ABC$Dept)
ABC$Education_1 <- as.factor(ABC$Education)
ABC$BusinessTravel_1 <- as.factor(ABC$BusinessTravel)
which will create the variables in the data frame ABC
.
Upvotes: 3