Calga
Calga

Reputation: 43

Unexpected behaviour using sapply on a factor in R

Using R, I recently got surprised by the output of sapply when used on a factor. Have a look at the following:

> F <- as.factor(c("A", "B", "C", "D", "E", "F"))

> sapply(F, function(x) x)
[1] A B C D E F
Levels: A B C D E F

> sapply(F, function(x) (x=="C"))
[1] FALSE FALSE  TRUE FALSE  FALSE  FALSE

So far so good, both outputs are as expected. But now, it is getting strange:

> sapply(F, function(x) if (TRUE) x else NA)
[1] A B C D E F
Levels: A B C D E F

> sapply(F, function(x) if (x=="C") x else NA)
[1] NA NA  3 NA NA NA

> sapply(F, function(x) {if (x=="C") foo <- "bar"; x})
[1] A B C D E F
Levels: A B C D E F

In these three cases, the first and the last results are as I would expect. The second one is somehow strange: I would expect to get something like [1] NA NA "C" NA NA NA.

My first guess was that the comparison (x=="C") has some impact on the x value inside of the if-clause. (Not outside of the clause otherwise we would have another result in the last case above.) Probably the x is treated as its index inside of the clause.

However this guess is not compatible with the following two observations:

> sapply(F, function(x) if (x==x) x else NA)
[1] A B C D E F
Levels: A B C D E F

> sapply(F, function(x) if (x=="C") F[x] else NA)
[1] NA NA  3 NA NA NA

Here, the (x==x) doesn't seem to have any influence at all and if x would be its index inside of the clause, we would get back a "C" instead of a 3.

My actual questions is: why does this happen? (By now I'm quite sure this is some factor related feature I'm not aware of...)

Upvotes: 2

Views: 237

Answers (1)

Roland
Roland

Reputation: 132576

sapply is basically lapply followed by simplify2array which is in this case simply a call to unlist.

First let's check if the behavior is caused by lapply:

lapply(F, function(x) if (x=="C") x else NA)
#[[1]]
#[1] NA
#
#[[2]]
#[1] NA
#
#[[3]]
#[1] C
#Levels: A B C D E F
#
#[[4]]
#[1] NA
#
#[[5]]
#[1] NA
#
#[[6]]
#[1] NA

As you see, the the third element is still a factor. However, the NA values are of class "logical":

class(lapply(F, function(x) if (x=="C") x else NA)[[1]])
#[1] "logical"

This means two quotes from help("unlist") are relevant:

Factors are treated specially. If all non-list elements of x are factors (or ordered factors) then the result will be a factor with levels the union of the level sets of the elements, in the order the levels occur in the level sets of the elements (which means that if all the elements have the same level set, that is the level set of the result).

and

Where possible the list elements are coerced to a common mode during the unlisting, and so the result often ends up as a character vector. Vectors will be coerced to the highest type of the components in the hierarchy NULL < raw < logical < integer < double < complex < character < list < expression: pairlists are treated as lists.

The second quote describes what happens here; the common mode of a factor (which internally is an integer vector with attributes) and a logical value is an integer. And this is what you get.

If you want to make sure that you get a factor vector from sapply create a factor NA value in your else condition:

sapply(F, function(x) if (x=="C") x else {is.na(x) <- TRUE; x})
#[1] <NA> <NA> C    <NA> <NA> <NA>
#Levels: A B C D E F

Upvotes: 3

Related Questions