grapestory
grapestory

Reputation: 183

Why do parenthesis break my dplyr::filter() output?

I was working through R4DS and learning about the filter() function when I came across a strange result. I was trying to filter a tibble to find only the observations that had a dep_delay and a arr_delay of less than 2 minutes. Here's my reprex:

library(tidyverse)
library(nycflights13)
filter(flights, dep_delay & arr_delay < 2)

which correctly outputs

# A tibble: 187,645 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl>
 1  2013     1     1      544            545        -1     1004           1022       -18
 2  2013     1     1      554            600        -6      812            837       -25
 3  2013     1     1      557            600        -3      709            723       -14
 4  2013     1     1      557            600        -3      838            846        -8
 5  2013     1     1      558            600        -2      849            851        -2
 6  2013     1     1      558            600        -2      853            856        -3
 7  2013     1     1      558            600        -2      923            937       -14
 8  2013     1     1      559            600        -1      854            902        -8
 9  2013     1     1      601            600         1      844            850        -6
10  2013     1     1      602            610        -8      812            820        -8
# ... with 187,635 more rows, and 10 more variables: carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>

however if I add a parenthesis for some reason the output changes

filter(flights, (dep_delay & arr_delay) < 2)
# A tibble: 327,394 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl>
 1  2013     1     1      517            515         2      830            819        11
 2  2013     1     1      533            529         4      850            830        20
 3  2013     1     1      542            540         2      923            850        33
 4  2013     1     1      544            545        -1     1004           1022       -18
 5  2013     1     1      554            600        -6      812            837       -25
 6  2013     1     1      554            558        -4      740            728        12
 7  2013     1     1      555            600        -5      913            854        19
 8  2013     1     1      557            600        -3      709            723       -14
 9  2013     1     1      557            600        -3      838            846        -8
10  2013     1     1      558            600        -2      753            745         8
# ... with 327,384 more rows, and 10 more variables: carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>

notice row 2 has incorrect values for both variables. At first I thought perhaps by adding the parenthesis I was converting (dep_delay & arr_delay) to TRUE or 1 but that actually would create an entirely different output. Can anyone help me understand what's going on?

Upvotes: 1

Views: 213

Answers (1)

r2evans
r2evans

Reputation: 160647

You aren't getting what you think you're getting.

dep_delay & arr_delay < 2 is two separate logical conditions.

  1. dep_delay, which is effectively (dep_delay != 0).
  2. arr_delay > 2, which is self-evident.

In truth, there are only 167,639 rows in flights where dep_delay and arr_delay are non-NA and less than 2.

with(flights, table(arr_delay < 2, dep_delay < 2, useNA = "always")) %>%
  addmargins()
#        
#          FALSE   TRUE   <NA>    Sum
#   FALSE  87941  39988      0 127929
#   TRUE   31778 167639      0 199417
#   <NA>     663    512   8255   9430
#   Sum   120382 208139   8255 336776

While I understand what you're trying to do, it does not translate the same into R syntax.

Just do one of:

dplyr::filter(flights, dep_delay < 2 & arr_delay < 2)
dplyr::filter(flights, dep_delay < 2, arr_delay < 2)

dplyr::filter defaults to an "AND" logic, so you can always use the second format above. Really, the only time you need to start using Logic operators is when you want an "OR" anywhere in the logic.


BTW: to see more about point 1 above, see

if (-1) 1 else 2
# [1] 1
if (0) 1 else 2
# [1] 2
if (1) 1 else 2
# [1] 1

Upvotes: 1

Related Questions