Reputation: 2763
I am having trouble with understanding %in%
. In Hadley Wickham's Book "R for data science" in section 5.2.2 it says, "A useful short-hand for this problem is x %in% y
. This will select every row where x is one of the values in y." Then this example is given:
nov_dec <- filter(flights, month %in% c(11, 12))
However, I when I look at the syntax, It appears that it should be selecting every row where y is one of the values in x(?) So in the example, all the cases where 11 and 12 (y) appear in "month" (x).
?"%in%"
doesn't make this any clearer to me. Obviously I'm missing something, but could someone please spell out exactly how this function works?
Upvotes: 3
Views: 156
Reputation: 124997
It appears that it should be selecting every row where y is one of the values in x(?) So in the example, all the cases where 11 and 12 appear in "month."
If you don't understand the behavior from looking at the example, try it out yourself. For example, you could do this:
> c(1,2,3) %in% c(2,4,6)
[1] FALSE TRUE FALSE
So it looks %in%
gives you a vector of TRUE
and FALSE
values that correspond to each of the items in the first argument (the one before %in%
). Let's try another:
> c(1,2,3) %in% c(2,4,6,8,10,12,1)
[1] TRUE TRUE FALSE
That confirms it: the first item in the returned vector is TRUE
if the first item in the first argument is found anywhere in the second argument, and so on. Compare that result to the one you get using match()
:
> match(c(1,2,3), c(2,4,6,8,10,12,1))
[1] 7 1 NA
So the difference between match()
and %in%
is that the former gives you the actual position in the second argument of the first match for each item in the first argument, whereas %in%
gives you a logical vector that just tells you whether each item in the first argument appears in the second.
In the context of Wickham's book example, month
is a vector of values representing the months in which various flights take place. So for the sake of argument, something like:
> month <- c(2,3,5,11,2,9,12,10,9,12,8,11,3)
Using the %in%
operator lets you turn that vector into the answers to the question Is this flight in month 11 or 12? like this:
> month %in% c(11,12)
[1] FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE FALSE TRUE
[13] FALSE
which gives you a logical vector, i.e. a list of true/false values. The filter()
function uses that logical vector to select corresponding rows from the flights
table. Used together, filter
and %in%
answer the question What are all the flights that occur in months 11 or 12?
If you turned the %in%
around and instead asked:
> c(11,12) %in% month
[1] TRUE TRUE
you're really just asking Are there any flights in each of month 11 and month 12?
I can imagine that it might seem odd to ask whether a large vector is "in" a vector that has only two values. Consider reading x %in% y
as Are each of the values from x
also in y
?
Upvotes: 7
Reputation: 4993
I think understanding how it works is somewhat semantic, and once you can say it logically then the grammar works itself out.
The key is to create a sentence in your head, as you read the code, that would include the context of apply
as you work you way through each row, and Boolean Logic to include or exclude rows based on what is contained in the "filter by list "%in% c( )
.
nov_dec <- filter(flights, month %in% c(11, 12))
In this case for your example above it should read like this:
"Set the variable nov_dec
equal to the subset of rows in flights
, where the variable column month
(from those rows) is in the list c(11,12)
. "
As r
works from the top down it looks at month and if the it is either 11
or 12
, the two variables in your list, then it includes them in nov_dec
, otherwise it just continues on.
Upvotes: 1
Reputation: 56
A quick exercise should be enough to demonstrate how the function works:
> x <- c(1, 2, 3, 4)
> y <- 4
> z <- 5
> x %in% y
[1] FALSE FALSE FALSE TRUE
So the fourth element of numeric vector x
is present in numeric vector y
.
> y %in% x
[1] TRUE
And the first element of y
(there's only one) is in x
.
> z %in% x
[1] FALSE
> x %in% z
[1] FALSE FALSE FALSE FALSE
And neither z
is in x
nor any of x
is in z
.
Also see the help for all matching functions with ?match
Upvotes: 2
Reputation: 189
this explicitly means: are value from x also in y The best way to understand is a exemple :
x <- 1:10 # numbers from 1 to 10
y <- (1:5)*2 # pair numbers between 2 and 10
y %in% x # all pair numbers between 2 and 10 are in numbers from 1 to 10
x %in% y #only pair numbers are return as True
Upvotes: 0