Reputation: 102
I am getting some unexpected behavior using %in% c() versus == c() to filter data on multiple conditions. I am returning incomplete results when the == c() method. Is there a logical explanation for this behavior?
df <- data.frame(region = as.factor(c(1,1,1,2,2,3,3,4,4,4)),
value = 1:10)
library(dplyr)
filter(df, region == c(1,2))
filter(df, region %in% c(1,2))
# using base syntax
df[df$region == c(1,2),]
df[df$region %in% c(1,2),]
The results do not change if I convert 'region' to numeric.
Upvotes: 0
Views: 93
Reputation: 15784
I am returning incomplete results when the == c() method. Is there a logical explanation for this behavior?
That's kind of logical, let's see:
df$region == 1:2
# [1] TRUE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
df$region %in% 1:2
# [1] TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
The reason is in the first form your trying to compare different lenght vectors, as @lukeA said in his comment this form is the same as (see implementation-of-standard-recycling-rules):
# 1 1 1 2 2 3 3 4 4 4 ## df$region
# 1 2 1 2 1 2 1 2 1 2 ## c(1,2) recycled to the same length
# T F T T F F F F F F ## equality of the corresponding elements
df$region == c(1,2,1,2,1,2,1,2,1,2)
# [1] TRUE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
Where each value on the left hand side of the operator is tested with the corresponding value on the right hand side of the operator.
However when you use df$region %in% 1:2
it's more in the idea:
sapply(df$region, function(x) { any(x==1:2) })
# [1] TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
I mean each value is tested against the second vector and TRUE is returned if there's one match.
Upvotes: 4