Reputation: 766
I am working my way through R4DS, and am currently working with 5.5.2 Exercise. I am tasked with comparing air_time
to arr_time - air_time
and began by running the following code:
# Setting up the packages
install.packages("tidyverse")
library(tidyverse)
install.packages("nycflights13")
library(nycflights13)
# View a tibble of the dataset
flights
# Creating a new object (lufttid = airtime) and viewing it
lufttid <- select(flights,
dep_time, arr_time, air_time, carrier, flight, time_hour)
lufttid
It would seem that air_time
is in minutes, whilst dep_time
and arr_time
are formatted [t]tmm. I tested this, creating lufttid.bedre
(airtime.better
):
# lufttid.bedre <- mutate(lufttid,
dep_time.min = dep_time %/% 100 * 60 + dep_time %% 100,
arr_time.min = arr_time %/% 100 * 60 + arr_time %% 100,
flight_time.min = arr_time.min - dep_time.min,
non_air_time = flight_time.min - air_time)
lufttid.bedre
The numerous negative values I got in non_air_time
, suggested that also air_time
was in the [t]tmm format, so I improved the data frame:
lufttid.bedre <- mutate(lufttid,
dep_time.min = dep_time %/% 100 * 60 + dep_time %% 100,
arr_time.min = arr_time %/% 100 * 60 + arr_time %% 100,
air_time.min = air_time %/% 100 * 60 + air_time %% 100,
flight_time.min = arr_time.min - dep_time.min,
non_air_time = flight_time.min - air_time.min)
lufttid.bedre
To my surprise, I still got negative values! Either I have done something bonkers, or there are some odd values in the dataset. Can anyone with a bit more insight explain where I’ve gone wrong? If I’ve done things right, it would suggest that there are some strange values in the dataset.
Note: I couldn’t think of any good tags to add to this, so if any editors have suggestions to improve my question, it would be very much welcomed.
Upvotes: 1
Views: 544
Reputation: 8377
I did a little investigation in your result table (the first one seems correct to me, the air time should be displayed in minutes):
library(ggplot2)
lufttid.bedre %>% filter(non_air_time<=0) %>% filter(non_air_time>-500) %>%
ggplot(.) + geom_histogram(aes(x=non_air_time), bins=300) + geom_vline(xintercept=-60*(1:6), col="red")
it appears that the modes of the distribution are around -50min, -90min and -160min, and there is another mode at -1400. It strongly suggests that theses are flights with little delay to another timezones at +01h00, +02h00, +03h00 and followings (and quite a peak around 24h00 also).
The help file ?flights
explains that the data is all flights that daparted from NYC:
Description : On-time data for all flights that departed NYC (i.e. JFK, LGA or EWR) in 2013.
And about the meaning of arrival times mention the local timezone.
dep_time,arr_time : Actual departure and arrival times, local tz.
That's enough proof to me to explain what you see. Now what you can do is start thinking of the timezone of every data, or just filter the flights that stay in the same time zone...
Upvotes: 2