Reputation: 169
I have a dataset and have split it into train (80%) and test (20%) set. First step is setting up decision tree and then I predict using my test set.
tree <- rpart(train$number ~ ., train, method = "class")
pred <- predict(tree,test, type ="class")
After running this, I get an error:
Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = attr(object, : Faktor 'orderland' hat neue Stufen Zypern
Which basically means, I have the land "Zypern" in my test set, but not in my train set. To deal with this problem I googled and tried this out by setting the factor levels equal.
train$orderland <- factor(train$orderland, levels=levels(test$orderland))
Summary of test and train data:
> summary(train)
number orderland lenkung transmission IntervalRange
Length:54616 NA's:54616 Length:54616 Length:54616 1: 7893
Class :character Class :character Class :character 2:39528
Mode :character Mode :character Mode :character 3: 7195
> summary(test)
number orderland lenkung transmission IntervalRange
Length:13655 Length:13655 Length:13655 Length:13655 1:1959
Class :character Class :character Class :character Class :character 2:9904
Mode :character Mode :character Mode :character Mode :character 3:1792
But I get the same error...any ideas why?
Upvotes: 2
Views: 426
Reputation: 1784
I think you need to force the train and test set to contain every possible value from categorical variables. I'm not sure how your dataset is structured, but assuming lenkung
is your land variable.
One way to go about it would be:
train_test = function(x,train_per=0.7){
smp_size = floor(train_per*nrow(x))
train_ind = sample(seq_len(nrow(x)),size = smp_size)
re = list()
re$train = x[train_ind,]
re$test = x[-train_ind,]
return(re)
}
splitted_data = split(data,data$lekung)
new_list = lapply(splitted_data,train_test)
Here we defined a function that splits a data frame (x) into a train and test set. We also use the split()
function to split your original data into several data frames, where each one contains only one of the possible values of lekung. Let's say the values could be "A", "B" or "C". In that case splitted_data
would be a list with 3 data frames, the first one containing all observations where lekung = "A", the second all observations with lekung = "B" etc...
Then, we apply to splitted_data
the function we defined above. Now new_list
contains 2 data frames for each possible value of lekung, a train and a test one.
So finally we just need to bind the rows of each train dataframe together and do the same for the test dataframes.
train = new_list[[1]][[1]]
test = new_list[[1]][[2]]
for(i in 2:length(a)){ # Then we use this loop to bind the data together
train = rbind(train,new_list[[i]][[1]])
test = rbind(test,new_list[[i]][[2]])
}
new_list
is a list of lists of 2 data frames. So we use new_list[[1]]
to access the 2 data frames corresponding to the first value of lekung and
new_list[[1]][[1]]
to access the first data frame there.
There probably is a better way to do this though.
Upvotes: 1