Reputation: 43
Using R, I recently got surprised by the output of sapply when used on a factor. Have a look at the following:
> F <- as.factor(c("A", "B", "C", "D", "E", "F"))
> sapply(F, function(x) x)
[1] A B C D E F
Levels: A B C D E F
> sapply(F, function(x) (x=="C"))
[1] FALSE FALSE TRUE FALSE FALSE FALSE
So far so good, both outputs are as expected. But now, it is getting strange:
> sapply(F, function(x) if (TRUE) x else NA)
[1] A B C D E F
Levels: A B C D E F
> sapply(F, function(x) if (x=="C") x else NA)
[1] NA NA 3 NA NA NA
> sapply(F, function(x) {if (x=="C") foo <- "bar"; x})
[1] A B C D E F
Levels: A B C D E F
In these three cases, the first and the last results are as I would expect. The second one is somehow strange: I would expect to get something like [1] NA NA "C" NA NA NA
.
My first guess was that the comparison (x=="C")
has some impact on the x
value inside of the if-clause. (Not outside of the clause otherwise we would have another result in the last case above.) Probably the x
is treated as its index inside of the clause.
However this guess is not compatible with the following two observations:
> sapply(F, function(x) if (x==x) x else NA)
[1] A B C D E F
Levels: A B C D E F
> sapply(F, function(x) if (x=="C") F[x] else NA)
[1] NA NA 3 NA NA NA
Here, the (x==x)
doesn't seem to have any influence at all and if x
would be its index inside of the clause, we would get back a "C"
instead of a 3
.
My actual questions is: why does this happen? (By now I'm quite sure this is some factor related feature I'm not aware of...)
Upvotes: 2
Views: 237
Reputation: 132576
sapply
is basically lapply
followed by simplify2array
which is in this case simply a call to unlist
.
First let's check if the behavior is caused by lapply
:
lapply(F, function(x) if (x=="C") x else NA)
#[[1]]
#[1] NA
#
#[[2]]
#[1] NA
#
#[[3]]
#[1] C
#Levels: A B C D E F
#
#[[4]]
#[1] NA
#
#[[5]]
#[1] NA
#
#[[6]]
#[1] NA
As you see, the the third element is still a factor. However, the NA
values are of class "logical":
class(lapply(F, function(x) if (x=="C") x else NA)[[1]])
#[1] "logical"
This means two quotes from help("unlist")
are relevant:
Factors are treated specially. If all non-list elements of x are factors (or ordered factors) then the result will be a factor with levels the union of the level sets of the elements, in the order the levels occur in the level sets of the elements (which means that if all the elements have the same level set, that is the level set of the result).
and
Where possible the list elements are coerced to a common mode during the unlisting, and so the result often ends up as a character vector. Vectors will be coerced to the highest type of the components in the hierarchy NULL < raw < logical < integer < double < complex < character < list < expression: pairlists are treated as lists.
The second quote describes what happens here; the common mode of a factor (which internally is an integer vector with attributes) and a logical value is an integer. And this is what you get.
If you want to make sure that you get a factor vector from sapply
create a factor NA
value in your else
condition:
sapply(F, function(x) if (x=="C") x else {is.na(x) <- TRUE; x})
#[1] <NA> <NA> C <NA> <NA> <NA>
#Levels: A B C D E F
Upvotes: 3