Reputation: 1486
I'm currently working through R for Data Science, and specifically working on exercise 5.7.1 #8 which is analyzing the library(nycflights13)
package data.
The question reads:
- For each plane, count the number of flights before the first delay of greater than 1 hour.
My attempt was to create a table that finds the first "over 60 minute" delay by using the first()
function:
first_del <- flights %>%
select(month, day, flight, dep_time, tailnum, dep_delay) %>%
filter(dep_delay > 60) %>%
group_by(month,day) %>%
arrange(month, day, dep_time) %>%
summarise(flight = first(flight), first_time = first(dep_time))
first_del
# A tibble: 365 x 4
# Groups: month [?]
month day flight first_time
<int> <int> <int> <int>
1 1 1 4576 811
2 1 2 22 126
3 1 3 104 50
4 1 4 608 106
5 1 5 11 37
6 1 6 27 746
7 1 7 145 756
8 1 8 4334 740
9 1 9 51 641
10 1 10 905 743
# ... with 355 more rows
My idea is to tag each row in the flights
tibble 1 if if it matches the month, day, and is less than the flight number of the first delayed flight on that day (for example, from the first_del
tibble above, flight 4576 is the first "over 60 minute delayed" flight on Jan 1, and every other flight before it will count). The desired output would be something like:
flights %>%
filter(dep_time > 805) %>%
select(month, day, flight, dep_time, tag)
# A tibble: 272,933 x 4
month day flight dep_time tag
<int> <int> <int> <int> <int>
1 1 1 269 807 1
2 1 1 4388 809 1
3 1 1 3538 810 1
4 1 1 2395 810 1
5 1 1 4260 811 1
6 1 1 4576 811 1
7 1 1 675 811 0
8 1 1 4537 812 0
9 1 1 914 813 0
10 1 1 346 814 0
Ideally, it would be great to tally all rows less than or equal to the flight number on each day according to the first_del
tibble. I've tried to use many combinations of filter, %in%, mutate, but have not yet been successful. Should I create a custom function?
My ultimate desired output is (with fictitious $count
values):
first_del
# A tibble: 365 x 4
# Groups: month [?]
month day flight first_time count
<int> <int> <int> <int> <int>
1 1 1 4576 811 212
2 1 2 22 126 216
3 1 3 104 50 298
4 1 4 608 106 220
5 1 5 11 37 168
6 1 6 27 746 287
7 1 7 145 756 302
8 1 8 4334 740 246
9 1 9 51 641 235
10 1 10 905 743 313
where $count
is the number of flights that preceded the first delayed flight on that day (as wanted by the question in the links above).
Upvotes: 1
Views: 791
Reputation: 43354
The question is per plane, so you really want to operate grouped by tailnum
. You can add a flag column, but really you need to end up with something you can pass to filter
(a logical vector) or slice
(a vector of row indices). There are various ways to do this, e.g. slice(seq(c(which(dep_delay > 60) - 1, n())[1]))
, but a nice approach is to use dplyr's cumall
(a cumulative version of all
, like cumsum
is to sum
) to generate a logical vector for filter
:
library(dplyr)
nycflights13::flights %>%
group_by(tailnum) %>%
arrange(year, month, day, dep_time) %>% # ensure order before cumany
filter(cumall(pmax(dep_delay, arr_delay) < 60)) %>%
tally() # count number of observations per group (tailnum)
#> # A tibble: 3,709 x 2
#> # Groups: tailnum [3,709]
#> tailnum n
#> <chr> <int>
#> 1 N10156 9
#> 2 N102UW 25
#> 3 N103US 46
#> 4 N104UW 3
#> 5 N105UW 22
#> 6 N107US 20
#> 7 N108UW 36
#> 8 N109UW 28
#> 9 N110UW 15
#> 10 N11107 7
#> # ... with 3,699 more rows
It's possible to make a intermediate table of first big delays and do a self-join to ID where they are, or add a flag value to some observations with if_else
, but regardless, subsetting up to those rows will still require similar logic to the cumall
above, so they're really just longer, slower approaches.
Upvotes: 0
Reputation: 16099
You can use which.max
on a logical vector to determine the first instance satisfying a condition. You also need to check the condition actually occurs.
library(dplyr)
library(nycflights13)
flights %>%
mutate(dep_delay = coalesce(dep_delay, 0)) %>%
arrange(month, day, dep_time) %>%
group_by(tailnum) %>%
summarise(max_delay = max(dep_delay),
which_first_geq_1hr = which.max(dep_delay > 60)) %>%
ungroup %>%
filter(max_delay > 60)
I'm assuming that delay means departure delay, NA
delay means 0
or at least less than an hour, and I'm ignoring planes that 'failed' to be delayed by more than hour. The coalesce
is necessary to avoid which.max(NA)
.
Upvotes: 0