Levels in R - Setting Correctly Against New Data Sets

Question

I'm using randomForest in R.

I train upon a set of data which includes a factor variable. This variable has the following levels:

[1] "Economics"    "Engineering"   "Medicine"
[4] "Accounting"   "Biology"       "Computer Science"
[7] "Physics"      "Law"           "Chemistry"

My evaluation set has a subset of those levels:

[1] "Law"          "Medicine"

The randomForest package requires the levels to be the same, so I have tried:

levels(evaluationSet$course) <- levels(trainingSet$course)

But then when I examine the rows in my evaluation set, the value has changed:

evaluationSet[1:3,c('course')]
# Gives "[1] Economics Engineering Economics", should give "[1] Law Medicine Law"

I'm new to R but I think what's happening here is that factors are an enumerated set. In the evaluation set, "Law" and "Medicine" are represented numerically in the factor (1 and 2 respectively). When I apply new levels, it's changing the values those indices map to.

I found a few similar topics on SO and tried their suggestions, but no luck:

evaluationSet <- droplevels(evaluationSet)
levels(evaluationSet$course) <- levels(trainingSet$course)
evaluationSet$course <- factor(evaluationSet$course)

How do I set the levels to be the same as the training set whilst preserving the values of my data?

EDIT: Adding results of head(evaluationSet) both before and after levels(evaluationSet$course) <- levels(trainingSet$course):

   timestamp score age takenBefore   course
1 1374910975  0.87  18           0      law
2 1374910975  0.81  21           0 medicine
3 1374910975  0.88  21           0      law
4 1374910975  0.88  21           0      law
5 1374910975  0.74  22           0      law
6 1374910975  0.76  23           1 medicine

   timestamp score age takenBefore      course
1 1374910975  0.87  18           0   economics
2 1374910975  0.81  21           0 engineering
3 1374910975  0.88  21           0   economics
4 1374910975  0.88  21           0   economics
5 1374910975  0.74  22           0   economics
6 1374910975  0.76  23           1 engineering

Marius · Accepted Answer

If you explicitly set the levels within factor(), you should have better luck:

eval = read.table(text="   timestamp score age takenBefore   course
1 1374910975  0.87  18           0      law
2 1374910975  0.81  21           0 medicine
3 1374910975  0.88  21           0      law
4 1374910975  0.88  21           0      law
5 1374910975  0.74  22           0      law
6 1374910975  0.76  23           1 medicine", header=TRUE)
eval$course = factor(eval$course, levels=c("economics", "engineering", "medicine", "law"))

Result:

> eval$course
[1] law      medicine law      law      law      medicine
Levels: economics engineering medicine law

Levels in R - Setting Correctly Against New Data Sets

Answers (2)

Related Questions