frankieR
frankieR

Reputation: 23

R dplyr filter function remove every other row

I am applying the R dplyr::filter function to the nycflights13 flights dataset. I am trying to select only flights in November and December, using two alternatives scripts.

library (tidyverse) 
library (nycflights13)

1 df1 <- filter (flights, month == 11 | month == 12)

2 df2 <- filter (flights, month == c(11,12))

df1 yields the expected result (with 55403 total observations) and df2 does too but with the dataset only contains half of the observations (27702 in total). df2 has every other row removed.

My question is: does anyone know why? I know the df2 syntax is incorrect, but I am trying to understand why it yields the outcome that it does.

Thanks

Upvotes: 1

Views: 333

Answers (3)

mrhellmann
mrhellmann

Reputation: 5529

My question is: does anyone know why?

R recycles vectors when they're not long enough for the whatever it is you need them for. In the df2 syntax the filter will filter for month = 11 in row 1, month = 12 in row 2, month = 11 in row 3...

You should end up with flights from November that were from odd rows of the original data & flights from December that were from even numbered rows of the original.

Upvotes: 1

frankieR
frankieR

Reputation: 23

In agree with @Duck about the difference between using | and ==

But this does not explain yet why R cuts off every other row.

I also ran two separate commands below filtering only the flights in Dec or Nov, respectively. And the average of the observations across the 2 (28135 and 27268) is the same as the # of observations in df2 (27702)

df3 <- filter (flights, month == 11)

df4 <- filter (flights, month == 12)

Upvotes: 0

Duck
Duck

Reputation: 39613

Maybe:

df2 <- filter (flights, month %in% c(11,12))

Your code when using == can be considering only the first element of the vector. Also, the | operator looks like is evaluating each condition separatedly whereas == is evaluating the condition simultaneously.

Upvotes: 0

Related Questions