Reputation: 189
I'm struggling with how to read the %in%
operator in R in "plain English" terms. I've seen multiple examples of code for its use, but not a clear explanation of how to read it.
For example, I've found terminology for the pipe operator %>%
that suggests to read it as "and then." I'm looking for a similar translation for the %in%
operator.
In the book R for Data Science in chapter 5 titled "Data Transformation" there is an example from the flights data set that reads as follows:
The following code finds all flights that departed in November or December:
filter(flights, month == 11 | month == 12)
A useful short-hand for this problem is
x %in% y
. This will select every row where x is one of the values in y. We could use it to rewrite the code above:nov_dec <- filter(flights, month %in% c(11, 12))
When I read "a useful short-hand for this problem is x %in% y
," and then look at the nov_dec
example, it seems like this is to be understood as "select every row where month (x
) is one of the values in c(11,12)
(y
)," which doesn't make sense to me.
However my brain wants to read it as something like, "Look for 11
and 12
in the month column." In this example, it seems like x
should be the values of 11
and 12
and the %in%
operator is checking if those values are in y
which would be the month column. My brain is reading this example from right to left.
However, all of the code examples I've found seem to indicate that this x %in% y should be read left to right and not right to left.
Can anyone help me read the %in%
operator in layman's terms please? Examples would be appreciated.
Upvotes: 3
Views: 183
Reputation: 10961
I think your disconnect is understanding how to apply "in" to a vector. You wrote that you want to read it as "Look for 11 and 12 in the month column." You can indeed think of it that way. Your example was:
nov_dec <- filter(flights, month %in% c(11, 12))
And that could be expressed in plain English as:
Give me all the flights where one of the values in
c(11, 12)
is in the month column
But we could also say that 11 and 12 are "in" the vector c(11, 12)
. That's what the left-to-right reading would be:
Give me all the flights whose month is in the vector
c(11, 12)
.
Or, expressed slightly differently and more verbosely:
Give me all the flights whose month is equal to one of the values in the vector
c(11, 12)
This is conceptually similar to using a bunch of |
operators in a row (month == 11 | month == 12
), but it's best not to think of those as exactly equivalent. Instead of explicitly comparing x
to every value in y
, you're asking the question "is x
equal to one of the values in y
?" That's different in the same way that saying "please turn off the lights" is different than saying "please walk over to that plate on the wall and pull the little stick on it downwards." It's expressing what you want instead of how to figure it out, which makes your code more readable, and code is read more often than it's written, so that's important!!!
Now I'm getting way out of my area - again, I don't know what R actually does here - but the underlying method of answering the question might also be different. It might use a binary search algorithm to find out if x
is in y
.
Upvotes: 0
Reputation: 145765
If I wanted to really "spell it out", I'd read x %in% y
as "for each x
value, is it in y
"?
nov_dec <- filter(flights, month %in% c(11, 12))"
When I read "A useful short-hand for this problem is x %in% y," and then look at the nov_dec example, it seems like this is to be understood as "select every row where month ('x') is one of the values in c(11,12) ('y'), which doesn't make sense to me.
However my brain wants to read it as something like, "Look for 11 and 12 in the month column." In this example, it seems like 'x' should be the values of 11 and 12 and the
%in%
operator is checking if those values are in 'y' which would be the month column. My brain is reading this example from right to left.
The left-vs-right thing is all about what you're asking about. x %in% y
is asking (using my verbose phrasing above), "for each x
value, is it in y
?" With that phrasing, we know to expect an answer (TRUE
or FALSE
) for every item in x
.
This might actually get clearer if we extend it a little more - two common related questions are "are any x
values in y
?" and "are all the x
values in y
"? These can be coded naturally as
any(x %in% y) # Are any x values in y?
all(x %in% y) # Are all x values in y?
To me, at least, those seem quite natural, and they use the left-to-right reading. It would get convoluted to try to use a right-to-left reading here, something like "look for the y
values in x
, did you cover every x
value with your matches?"
Upvotes: 4
Reputation: 18714
That's actually a really good question. Think about the literal nature here:
When you use %in%
it is in lieu of an 'or' statement-- are any of these in here?
answers = data.frame(ans = sample(rep(c("yes","no","maybe"),
each = 3, times = 2)),
ind = 1:9)
# yes or no?
answers[answers$ans == "yes"|answers$ans == "no",]
# ans ind
# 1 yes 1
# 2 yes 2
# 4 no 4
# 5 yes 5
# 6 no 6
# 8 no 8
# 10 yes 1
# 12 no 3
# 13 no 4
# 16 yes 7
# 17 no 8
# 18 yes 9
# now about %in%
answers[answers$ans %in% c("yes","no"),]
# ans ind
# 1 yes 1
# 2 yes 2
# 4 no 4
# 5 yes 5
# 6 no 6
# 8 no 8
# 10 yes 1
# 12 no 3
# 13 no 4
# 16 yes 7
# 17 no 8
# 18 yes 9
# yes and no?
answers[answers$ans == c("yes","no"),]
# ans ind
# 1 yes 1
# 4 no 4
# 5 yes 5
# 6 no 6
# 8 no 8
# 12 no 3
# what happened here? were you expecting that?
# this checked the first row for yes,
# the second row for no,
# the third row for yes,
# the fourth row for no and so on...
Upvotes: 3