Canned Man
Canned Man

Reputation: 766

Why do I get negative non-airtime in the nycflights13 dataset

I am working my way through R4DS, and am currently working with 5.5.2 Exercise. I am tasked with comparing air_time to arr_time - air_time and began by running the following code:

# Setting up the packages
install.packages("tidyverse")
library(tidyverse)
install.packages("nycflights13")
library(nycflights13)

# View a tibble of the dataset
flights

# Creating a new object (lufttid = airtime) and viewing it
lufttid <- select(flights,
                  dep_time, arr_time, air_time, carrier, flight, time_hour)
lufttid

It would seem that air_time is in minutes, whilst dep_time and arr_time are formatted [t]tmm. I tested this, creating lufttid.bedre (airtime.better):

# lufttid.bedre <- mutate(lufttid,
                          dep_time.min = dep_time %/% 100 * 60 + dep_time %% 100,
                          arr_time.min = arr_time %/% 100 * 60 + arr_time %% 100,
                          flight_time.min = arr_time.min - dep_time.min,
                          non_air_time = flight_time.min - air_time)
lufttid.bedre

The numerous negative values I got in non_air_time, suggested that also air_time was in the [t]tmm format, so I improved the data frame:

lufttid.bedre <- mutate(lufttid,
                        dep_time.min = dep_time %/% 100 * 60 + dep_time %% 100,
                        arr_time.min = arr_time %/% 100 * 60 + arr_time %% 100,
                        air_time.min = air_time %/% 100 * 60 + air_time %% 100,
                        flight_time.min = arr_time.min - dep_time.min,
                        non_air_time = flight_time.min - air_time.min)
lufttid.bedre

To my surprise, I still got negative values! Either I have done something bonkers, or there are some odd values in the dataset. Can anyone with a bit more insight explain where I’ve gone wrong? If I’ve done things right, it would suggest that there are some strange values in the dataset.

Note: I couldn’t think of any good tags to add to this, so if any editors have suggestions to improve my question, it would be very much welcomed.

Upvotes: 1

Views: 544

Answers (1)

agenis
agenis

Reputation: 8377

I did a little investigation in your result table (the first one seems correct to me, the air time should be displayed in minutes):

library(ggplot2)
lufttid.bedre %>% filter(non_air_time<=0) %>% filter(non_air_time>-500) %>%  
  ggplot(.) + geom_histogram(aes(x=non_air_time), bins=300) + geom_vline(xintercept=-60*(1:6), col="red")

enter image description here

it appears that the modes of the distribution are around -50min, -90min and -160min, and there is another mode at -1400. It strongly suggests that theses are flights with little delay to another timezones at +01h00, +02h00, +03h00 and followings (and quite a peak around 24h00 also). The help file ?flights explains that the data is all flights that daparted from NYC:

Description : On-time data for all flights that departed NYC (i.e. JFK, LGA or EWR) in 2013.

And about the meaning of arrival times mention the local timezone.

dep_time,arr_time : Actual departure and arrival times, local tz.

That's enough proof to me to explain what you see. Now what you can do is start thinking of the timezone of every data, or just filter the flights that stay in the same time zone...

Upvotes: 2

Related Questions