Reputation: 131
I am new to R and currently working through R for Data Science to teach myself some basics. I am working on the exercises in chapter 5.2.4, and when doing so tried to create two versions of a filtering code, which I was assuming would be equivalent.
For this, I loaded the following packages:
library(nycflights13)
library(tidyverse)
I then wanted to filter out those flights from the included flights data set that departed between midnight and 6 am. I used the following codes:
d1 <- filter(flights, dep_time >= 0 & dep_time <= 600) #yields 9344 rows
d2 <- flights[between(flights$dep_time, 0 , 600),] # yields 17599 rows
d3 <- filter(flights, between(dep_time,0,600)) #again yields 9344 rows
I cannot figure out why d2 is different from the others. Can anyone explain? Thank you for taking the time to answer such a basic question.
Upvotes: 3
Views: 62
Reputation: 886938
There are missing elements which return NA in addition to TRUE/FALSE and when we have NA as one of the logical elements, the row returned will be NA. That could be the reason that we have more number of rows.
sum(between(flights$dep_time, 0 , 600), na.rm = TRUE)
#[1] 9344
The filter
will take account of the NA
elements and remove those NA elements
One option would be to return the NA
elements as `FALSE
i1 <- between(flights$dep_time, 0, 600) & !is.na(flights$dep_time)
d2 <- flights[i1,]
dim(d2)
#[1] 9344 19
Upvotes: 3