Cam Price-Austin
Cam Price-Austin

Reputation: 1758

Levels in R - Setting Correctly Against New Data Sets

I'm using randomForest in R.

I train upon a set of data which includes a factor variable. This variable has the following levels:

[1] "Economics"    "Engineering"   "Medicine"
[4] "Accounting"   "Biology"       "Computer Science"
[7] "Physics"      "Law"           "Chemistry"

My evaluation set has a subset of those levels:

[1] "Law"          "Medicine"

The randomForest package requires the levels to be the same, so I have tried:

levels(evaluationSet$course) <- levels(trainingSet$course)

But then when I examine the rows in my evaluation set, the value has changed:

evaluationSet[1:3,c('course')]
# Gives "[1] Economics Engineering Economics", should give "[1] Law Medicine Law"

I'm new to R but I think what's happening here is that factors are an enumerated set. In the evaluation set, "Law" and "Medicine" are represented numerically in the factor (1 and 2 respectively). When I apply new levels, it's changing the values those indices map to.

I found a few similar topics on SO and tried their suggestions, but no luck:

evaluationSet <- droplevels(evaluationSet)
levels(evaluationSet$course) <- levels(trainingSet$course)
evaluationSet$course <- factor(evaluationSet$course)

How do I set the levels to be the same as the training set whilst preserving the values of my data?

EDIT: Adding results of head(evaluationSet) both before and after levels(evaluationSet$course) <- levels(trainingSet$course):

   timestamp score age takenBefore   course
1 1374910975  0.87  18           0      law
2 1374910975  0.81  21           0 medicine
3 1374910975  0.88  21           0      law
4 1374910975  0.88  21           0      law
5 1374910975  0.74  22           0      law
6 1374910975  0.76  23           1 medicine

   timestamp score age takenBefore      course
1 1374910975  0.87  18           0   economics
2 1374910975  0.81  21           0 engineering
3 1374910975  0.88  21           0   economics
4 1374910975  0.88  21           0   economics
5 1374910975  0.74  22           0   economics
6 1374910975  0.76  23           1 engineering

Upvotes: 2

Views: 1490

Answers (2)

joran
joran

Reputation: 173587

Your intuition is basically correct. The crux of the issue is that the order of the levels matters. They aren't a set, so much as a mapping.

Here's an example:

f <- factor(sample(letters[4:6],20,replace = TRUE))
> f
 [1] d e e d e e f d d f e e d d e e f e d d
Levels: d e f
> levels(f)
[1] "d" "e" "f"
> levels(f) <- letters[1:6]
> f
 [1] a b b a b b c a a c b b a a b b c b a a
Levels: a b c d e f

Note that when we add levels, the "first" three levels have been supplanted. Instead,

> f <- factor(sample(letters[4:6],20,replace = TRUE))
> f
 [1] d f f e e d d f d d f d d e e e e f d e
Levels: d e f
> levels(f) <- c(letters[4:6],letters[1:3])
> f
 [1] d f f e e d d f d d f d d e e e e f d e
Levels: d e f a b c

So you just need to respect the current ordering of levels in your evaluation set.

One way to think about this is that factors are really just a vector of integers. Whereever R codes a 1 will correspond to the first level. And since it will order them alphabetically, when you add levels you might mess with that mapping.

Upvotes: 3

Marius
Marius

Reputation: 60090

If you explicitly set the levels within factor(), you should have better luck:

eval = read.table(text="   timestamp score age takenBefore   course
1 1374910975  0.87  18           0      law
2 1374910975  0.81  21           0 medicine
3 1374910975  0.88  21           0      law
4 1374910975  0.88  21           0      law
5 1374910975  0.74  22           0      law
6 1374910975  0.76  23           1 medicine", header=TRUE)
eval$course = factor(eval$course, levels=c("economics", "engineering", "medicine", "law"))

Result:

> eval$course
[1] law      medicine law      law      law      medicine
Levels: economics engineering medicine law

Upvotes: 2

Related Questions