Danny  Morris
Danny Morris

Reputation: 102

%in% vs == for subsetting

I am getting some unexpected behavior using %in% c() versus == c() to filter data on multiple conditions. I am returning incomplete results when the == c() method. Is there a logical explanation for this behavior?

df <- data.frame(region = as.factor(c(1,1,1,2,2,3,3,4,4,4)),
             value = 1:10)

library(dplyr)   
filter(df, region == c(1,2))
filter(df, region %in% c(1,2))

# using base syntax
df[df$region == c(1,2),]
df[df$region %in% c(1,2),]

The results do not change if I convert 'region' to numeric.

Upvotes: 0

Views: 93

Answers (1)

Tensibai
Tensibai

Reputation: 15784

I am returning incomplete results when the == c() method. Is there a logical explanation for this behavior?

That's kind of logical, let's see:

df$region == 1:2
# [1]  TRUE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
 df$region %in% 1:2
# [1]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE

The reason is in the first form your trying to compare different lenght vectors, as @lukeA said in his comment this form is the same as (see implementation-of-standard-recycling-rules):

# 1 1 1 2 2 3 3 4 4 4  ## df$region
# 1 2 1 2 1 2 1 2 1 2  ## c(1,2) recycled to the same length
# T F T T F F F F F F  ## equality of the corresponding elements

df$region == c(1,2,1,2,1,2,1,2,1,2)
# [1]  TRUE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE

Where each value on the left hand side of the operator is tested with the corresponding value on the right hand side of the operator.

However when you use df$region %in% 1:2 it's more in the idea:

sapply(df$region, function(x) { any(x==1:2) })
# [1]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE

I mean each value is tested against the second vector and TRUE is returned if there's one match.

Upvotes: 4

Related Questions