Reputation: 1758
I'm using randomForest in R.
I train upon a set of data which includes a factor variable. This variable has the following levels:
[1] "Economics" "Engineering" "Medicine"
[4] "Accounting" "Biology" "Computer Science"
[7] "Physics" "Law" "Chemistry"
My evaluation set has a subset of those levels:
[1] "Law" "Medicine"
The randomForest package requires the levels to be the same, so I have tried:
levels(evaluationSet$course) <- levels(trainingSet$course)
But then when I examine the rows in my evaluation set, the value has changed:
evaluationSet[1:3,c('course')]
# Gives "[1] Economics Engineering Economics", should give "[1] Law Medicine Law"
I'm new to R but I think what's happening here is that factors are an enumerated set. In the evaluation set, "Law" and "Medicine" are represented numerically in the factor (1 and 2 respectively). When I apply new levels, it's changing the values those indices map to.
I found a few similar topics on SO and tried their suggestions, but no luck:
evaluationSet <- droplevels(evaluationSet)
levels(evaluationSet$course) <- levels(trainingSet$course)
evaluationSet$course <- factor(evaluationSet$course)
How do I set the levels to be the same as the training set whilst preserving the values of my data?
EDIT: Adding results of head(evaluationSet) both before and after levels(evaluationSet$course) <- levels(trainingSet$course):
timestamp score age takenBefore course
1 1374910975 0.87 18 0 law
2 1374910975 0.81 21 0 medicine
3 1374910975 0.88 21 0 law
4 1374910975 0.88 21 0 law
5 1374910975 0.74 22 0 law
6 1374910975 0.76 23 1 medicine
timestamp score age takenBefore course
1 1374910975 0.87 18 0 economics
2 1374910975 0.81 21 0 engineering
3 1374910975 0.88 21 0 economics
4 1374910975 0.88 21 0 economics
5 1374910975 0.74 22 0 economics
6 1374910975 0.76 23 1 engineering
Upvotes: 2
Views: 1490
Reputation: 173587
Your intuition is basically correct. The crux of the issue is that the order of the levels matters. They aren't a set, so much as a mapping.
Here's an example:
f <- factor(sample(letters[4:6],20,replace = TRUE))
> f
[1] d e e d e e f d d f e e d d e e f e d d
Levels: d e f
> levels(f)
[1] "d" "e" "f"
> levels(f) <- letters[1:6]
> f
[1] a b b a b b c a a c b b a a b b c b a a
Levels: a b c d e f
Note that when we add levels, the "first" three levels have been supplanted. Instead,
> f <- factor(sample(letters[4:6],20,replace = TRUE))
> f
[1] d f f e e d d f d d f d d e e e e f d e
Levels: d e f
> levels(f) <- c(letters[4:6],letters[1:3])
> f
[1] d f f e e d d f d d f d d e e e e f d e
Levels: d e f a b c
So you just need to respect the current ordering of levels in your evaluation set.
One way to think about this is that factors are really just a vector of integers. Whereever R codes a 1 will correspond to the first level. And since it will order them alphabetically, when you add levels you might mess with that mapping.
Upvotes: 3
Reputation: 60090
If you explicitly set the levels within factor()
, you should have better luck:
eval = read.table(text=" timestamp score age takenBefore course
1 1374910975 0.87 18 0 law
2 1374910975 0.81 21 0 medicine
3 1374910975 0.88 21 0 law
4 1374910975 0.88 21 0 law
5 1374910975 0.74 22 0 law
6 1374910975 0.76 23 1 medicine", header=TRUE)
eval$course = factor(eval$course, levels=c("economics", "engineering", "medicine", "law"))
Result:
> eval$course
[1] law medicine law law law medicine
Levels: economics engineering medicine law
Upvotes: 2