pineapple
pineapple

Reputation: 169

Error comes up, when my test set has data which my train data doesn't have?

I have a dataset and have split it into train (80%) and test (20%) set. First step is setting up decision tree and then I predict using my test set.

tree <- rpart(train$number ~ ., train, method = "class")
pred <- predict(tree,test, type ="class")

After running this, I get an error:

Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = attr(object, : Faktor 'orderland' hat neue Stufen Zypern

Which basically means, I have the land "Zypern" in my test set, but not in my train set. To deal with this problem I googled and tried this out by setting the factor levels equal.

train$orderland <- factor(train$orderland, levels=levels(test$orderland))

Summary of test and train data:

> summary(train)
 number             orderland      lenkung          transmission IntervalRange
 Length:54616       NA's:54616   Length:54616       Length:54616       1: 7893      
 Class :character                Class :character   Class :character   2:39528      
 Mode  :character                Mode  :character   Mode  :character   3: 7195 

> summary(test)
 number              orderland           lenkung          transmission IntervalRange
 Length:13655       Length:13655       Length:13655       Length:13655       1:1959       
 Class :character   Class :character   Class :character   Class :character   2:9904       
 Mode  :character   Mode  :character   Mode  :character   Mode  :character   3:1792

But I get the same error...any ideas why?

Upvotes: 2

Views: 426

Answers (1)

Fino
Fino

Reputation: 1784

I think you need to force the train and test set to contain every possible value from categorical variables. I'm not sure how your dataset is structured, but assuming lenkung is your land variable.

One way to go about it would be:

train_test = function(x,train_per=0.7){
  smp_size = floor(train_per*nrow(x))

  train_ind = sample(seq_len(nrow(x)),size = smp_size)

  re = list()
  re$train = x[train_ind,]
  re$test = x[-train_ind,]
  return(re)
}
splitted_data = split(data,data$lekung)
new_list = lapply(splitted_data,train_test) 

Here we defined a function that splits a data frame (x) into a train and test set. We also use the split() function to split your original data into several data frames, where each one contains only one of the possible values of lekung. Let's say the values could be "A", "B" or "C". In that case splitted_data would be a list with 3 data frames, the first one containing all observations where lekung = "A", the second all observations with lekung = "B" etc...

Then, we apply to splitted_data the function we defined above. Now new_list contains 2 data frames for each possible value of lekung, a train and a test one.

So finally we just need to bind the rows of each train dataframe together and do the same for the test dataframes.

train = new_list[[1]][[1]]
test = new_list[[1]][[2]]
for(i in 2:length(a)){  # Then we use this loop to bind the data together
  train = rbind(train,new_list[[i]][[1]])
  test = rbind(test,new_list[[i]][[2]])
}

new_list is a list of lists of 2 data frames. So we use new_list[[1]] to access the 2 data frames corresponding to the first value of lekung and new_list[[1]][[1]] to access the first data frame there.

There probably is a better way to do this though.

Upvotes: 1

Related Questions