Reputation: 183
I was working through R4DS and learning about the filter()
function when I came across a strange result. I was trying to filter
a tibble
to find only the observations that had a dep_delay
and a arr_delay
of less than 2 minutes. Here's my reprex:
library(tidyverse)
library(nycflights13)
filter(flights, dep_delay & arr_delay < 2)
which correctly outputs
# A tibble: 187,645 x 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl>
1 2013 1 1 544 545 -1 1004 1022 -18
2 2013 1 1 554 600 -6 812 837 -25
3 2013 1 1 557 600 -3 709 723 -14
4 2013 1 1 557 600 -3 838 846 -8
5 2013 1 1 558 600 -2 849 851 -2
6 2013 1 1 558 600 -2 853 856 -3
7 2013 1 1 558 600 -2 923 937 -14
8 2013 1 1 559 600 -1 854 902 -8
9 2013 1 1 601 600 1 844 850 -6
10 2013 1 1 602 610 -8 812 820 -8
# ... with 187,635 more rows, and 10 more variables: carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>
however if I add a parenthesis for some reason the output changes
filter(flights, (dep_delay & arr_delay) < 2)
# A tibble: 327,394 x 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl>
1 2013 1 1 517 515 2 830 819 11
2 2013 1 1 533 529 4 850 830 20
3 2013 1 1 542 540 2 923 850 33
4 2013 1 1 544 545 -1 1004 1022 -18
5 2013 1 1 554 600 -6 812 837 -25
6 2013 1 1 554 558 -4 740 728 12
7 2013 1 1 555 600 -5 913 854 19
8 2013 1 1 557 600 -3 709 723 -14
9 2013 1 1 557 600 -3 838 846 -8
10 2013 1 1 558 600 -2 753 745 8
# ... with 327,384 more rows, and 10 more variables: carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>
notice row 2 has incorrect values for both variables. At first I thought perhaps by adding the parenthesis I was converting (dep_delay & arr_delay)
to TRUE
or 1
but that actually would create an entirely different output. Can anyone help me understand what's going on?
Upvotes: 1
Views: 213
Reputation: 160647
You aren't getting what you think you're getting.
dep_delay & arr_delay < 2
is two separate logical conditions.
dep_delay
, which is effectively (dep_delay != 0)
.arr_delay > 2
, which is self-evident.In truth, there are only 167,639 rows in flights
where dep_delay
and arr_delay
are non-NA
and less than 2.
with(flights, table(arr_delay < 2, dep_delay < 2, useNA = "always")) %>%
addmargins()
#
# FALSE TRUE <NA> Sum
# FALSE 87941 39988 0 127929
# TRUE 31778 167639 0 199417
# <NA> 663 512 8255 9430
# Sum 120382 208139 8255 336776
While I understand what you're trying to do, it does not translate the same into R syntax.
Just do one of:
dplyr::filter(flights, dep_delay < 2 & arr_delay < 2)
dplyr::filter(flights, dep_delay < 2, arr_delay < 2)
dplyr::filter
defaults to an "AND" logic, so you can always use the second format above. Really, the only time you need to start using Logic
operators is when you want an "OR" anywhere in the logic.
BTW: to see more about point 1 above, see
if (-1) 1 else 2
# [1] 1
if (0) 1 else 2
# [1] 2
if (1) 1 else 2
# [1] 1
Upvotes: 1