NelnewR
NelnewR

Reputation: 131

Two expressions in R that should yield the same answer, but they don't

I am new to R and currently working through R for Data Science to teach myself some basics. I am working on the exercises in chapter 5.2.4, and when doing so tried to create two versions of a filtering code, which I was assuming would be equivalent. For this, I loaded the following packages: library(nycflights13) library(tidyverse)

I then wanted to filter out those flights from the included flights data set that departed between midnight and 6 am. I used the following codes:

d1 <- filter(flights, dep_time >= 0 & dep_time <= 600)  #yields 9344 rows
d2 <- flights[between(flights$dep_time, 0 , 600),]   # yields 17599 rows
d3 <- filter(flights, between(dep_time,0,600))      #again yields 9344 rows

I cannot figure out why d2 is different from the others. Can anyone explain? Thank you for taking the time to answer such a basic question.

Upvotes: 3

Views: 62

Answers (1)

akrun
akrun

Reputation: 886938

There are missing elements which return NA in addition to TRUE/FALSE and when we have NA as one of the logical elements, the row returned will be NA. That could be the reason that we have more number of rows.

sum(between(flights$dep_time, 0 , 600), na.rm = TRUE)
#[1] 9344

The filter will take account of the NA elements and remove those NA elements


One option would be to return the NA elements as `FALSE

i1 <- between(flights$dep_time, 0, 600)  & !is.na(flights$dep_time)
d2 <- flights[i1,]
dim(d2)
#[1] 9344   19

Upvotes: 3

Related Questions