pd441
pd441

Reputation: 2763

Trouble with understanding explanation of %in%

I am having trouble with understanding %in%. In Hadley Wickham's Book "R for data science" in section 5.2.2 it says, "A useful short-hand for this problem is x %in% y. This will select every row where x is one of the values in y." Then this example is given:

 nov_dec <- filter(flights, month %in% c(11, 12))

However, I when I look at the syntax, It appears that it should be selecting every row where y is one of the values in x(?) So in the example, all the cases where 11 and 12 (y) appear in "month" (x).

?"%in%" doesn't make this any clearer to me. Obviously I'm missing something, but could someone please spell out exactly how this function works?

Upvotes: 3

Views: 156

Answers (4)

Caleb
Caleb

Reputation: 124997

It appears that it should be selecting every row where y is one of the values in x(?) So in the example, all the cases where 11 and 12 appear in "month."

If you don't understand the behavior from looking at the example, try it out yourself. For example, you could do this:

> c(1,2,3) %in% c(2,4,6)
[1] FALSE  TRUE FALSE

So it looks %in% gives you a vector of TRUE and FALSE values that correspond to each of the items in the first argument (the one before %in%). Let's try another:

> c(1,2,3) %in% c(2,4,6,8,10,12,1)
[1]  TRUE  TRUE FALSE

That confirms it: the first item in the returned vector is TRUE if the first item in the first argument is found anywhere in the second argument, and so on. Compare that result to the one you get using match():

> match(c(1,2,3), c(2,4,6,8,10,12,1))
[1]  7  1 NA

So the difference between match() and %in% is that the former gives you the actual position in the second argument of the first match for each item in the first argument, whereas %in% gives you a logical vector that just tells you whether each item in the first argument appears in the second.

In the context of Wickham's book example, month is a vector of values representing the months in which various flights take place. So for the sake of argument, something like:

> month <- c(2,3,5,11,2,9,12,10,9,12,8,11,3)

Using the %in% operator lets you turn that vector into the answers to the question Is this flight in month 11 or 12? like this:

> month %in% c(11,12)
[1] FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE  TRUE
[13] FALSE

which gives you a logical vector, i.e. a list of true/false values. The filter() function uses that logical vector to select corresponding rows from the flights table. Used together, filter and %in% answer the question What are all the flights that occur in months 11 or 12?

If you turned the %in% around and instead asked:

> c(11,12) %in% month
[1] TRUE TRUE

you're really just asking Are there any flights in each of month 11 and month 12?

I can imagine that it might seem odd to ask whether a large vector is "in" a vector that has only two values. Consider reading x %in% y as Are each of the values from x also in y?

Upvotes: 7

sconfluentus
sconfluentus

Reputation: 4993

I think understanding how it works is somewhat semantic, and once you can say it logically then the grammar works itself out.

The key is to create a sentence in your head, as you read the code, that would include the context of apply as you work you way through each row, and Boolean Logic to include or exclude rows based on what is contained in the "filter by list "%in% c( ).

 nov_dec <- filter(flights, month %in% c(11, 12))

In this case for your example above it should read like this:

"Set the variable nov_dec equal to the subset of rows in flights, where the variable column month (from those rows) is in the list c(11,12). "

As r works from the top down it looks at month and if the it is either 11 or 12, the two variables in your list, then it includes them in nov_dec, otherwise it just continues on.

Upvotes: 1

ajs
ajs

Reputation: 56

A quick exercise should be enough to demonstrate how the function works:

> x <- c(1, 2, 3, 4)
> y <- 4
> z <- 5

> x %in% y
[1] FALSE FALSE FALSE  TRUE

So the fourth element of numeric vector x is present in numeric vector y.

> y %in% x
[1] TRUE

And the first element of y (there's only one) is in x.

> z %in% x
[1] FALSE
> x %in% z
[1] FALSE FALSE FALSE FALSE

And neither z is in x nor any of x is in z.

Also see the help for all matching functions with ?match

Upvotes: 2

Damien Cormann
Damien Cormann

Reputation: 189

this explicitly means: are value from x also in y The best way to understand is a exemple :

x <- 1:10 # numbers from 1 to 10 
y <- (1:5)*2 # pair numbers between 2 and 10 

y %in% x # all pair numbers between 2 and 10 are in numbers from 1 to 10 

x %in% y #only pair numbers are return as True

Upvotes: 0

Related Questions