Reputation: 23
I am applying the R dplyr::filter
function to the nycflights13
flights dataset.
I am trying to select only flights in November and December, using two alternatives scripts.
library (tidyverse)
library (nycflights13)
1
df1 <- filter (flights, month == 11 | month == 12)
2
df2 <- filter (flights, month == c(11,12))
df1 yields the expected result (with 55403 total observations) and df2 does too but with the dataset only contains half of the observations (27702 in total). df2 has every other row removed.
My question is: does anyone know why? I know the df2 syntax is incorrect, but I am trying to understand why it yields the outcome that it does.
Thanks
Upvotes: 1
Views: 333
Reputation: 5529
My question is: does anyone know why?
R recycles vectors when they're not long enough for the whatever it is you need them for. In the df2 syntax the filter
will filter for month = 11 in row 1, month = 12 in row 2, month = 11 in row 3...
You should end up with flights from November that were from odd rows of the original data & flights from December that were from even numbered rows of the original.
Upvotes: 1
Reputation: 23
In agree with @Duck about the difference between using |
and ==
But this does not explain yet why R cuts off every other row.
I also ran two separate commands below filtering only the flights in Dec or Nov, respectively. And the average of the observations across the 2 (28135 and 27268) is the same as the # of observations in df2
(27702)
df3 <- filter (flights, month == 11)
df4 <- filter (flights, month == 12)
Upvotes: 0
Reputation: 39613
Maybe:
df2 <- filter (flights, month %in% c(11,12))
Your code when using ==
can be considering only the first element of the vector. Also, the |
operator looks like is evaluating each condition separatedly whereas ==
is evaluating the condition simultaneously.
Upvotes: 0