Reputation: 1610
Consider the flights
data:
library(nycflights13)
flights
# A tibble: 336,776 x 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr>
1 2013 1 1 517 515 2 830 819 11 UA 1545 N14228
2 2013 1 1 533 529 4 850 830 20 UA 1714 N24211
3 2013 1 1 542 540 2 923 850 33 AA 1141 N619AA
4 2013 1 1 544 545 -1 1004 1022 -18 B6 725 N804JB
5 2013 1 1 554 600 -6 812 837 -25 DL 461 N668DN
6 2013 1 1 554 558 -4 740 728 12 UA 1696 N39463
7 2013 1 1 555 600 -5 913 854 19 B6 507 N516JB
8 2013 1 1 557 600 -3 709 723 -14 EV 5708 N829AS
9 2013 1 1 557 600 -3 838 846 -8 B6 79 N593JB
10 2013 1 1 558 600 -2 753 745 8 AA 301 N3ALAA
# … with 336,766 more rows, and 7 more variables: origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
I'm convinced that many of the tailnum
entries are duplicated. anyDuplicated
would confirm this, but I want to see the duplicates side-by-side. The best that I could come up with is:
flights[duplicated(flights$tailnum),]->dups
dups[order(dups$tailnum),]
# A tibble: 332,732 x 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr>
1 2013 3 23 1340 1300 40 1638 1554 44 DL 1685 D942DN
2 2013 3 24 859 835 24 1142 1140 2 DL 1959 D942DN
3 2013 7 5 1253 1259 -6 1518 1529 -11 DL 781 D942DN
4 2013 1 1 2100 2100 0 2307 2250 17 MQ 4584 N0EGMQ
5 2013 1 2 827 835 -8 1059 1105 -6 MQ 4610 N0EGMQ
6 2013 1 2 2014 2020 -6 2256 2245 11 MQ 4662 N0EGMQ
7 2013 1 4 1621 1625 -4 1853 1855 -2 MQ 4661 N0EGMQ
8 2013 1 5 834 835 -1 1050 1105 -15 MQ 4610 N0EGMQ
9 2013 1 6 832 835 -3 1101 1105 -4 MQ 4610 N0EGMQ
10 2013 1 6 2051 2100 -9 2241 2250 -9 MQ 4584 N0EGMQ
# … with 332,722 more rows, and 7 more variables: origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
This gives the desired output, but I find the input extremely ugly. Was there a better way? It seems like a natural use case for piping, but given how long base R has gone without that feature, I expected to find a good alternative to that. I was hoping to be able to compose order
and duplicated
somehow, but no method has occurred to me.
Upvotes: 1
Views: 39
Reputation: 72623
You can try this.
r2 <- with(flights, flights[o <- order(r <- rank(tailnum, na.last='keep')), ][duplicated(r[o]), ])
Where
flights[duplicated(flights$tailnum),]->dups
r1 <- dups[order(dups$tailnum),]
stopifnot(all.equal(r1, r2))
Upvotes: 0
Reputation: 886948
In tidyverse
, this can be piped
library(dplyr)
flights %>%
filter(duplicated(tailnum)) %>%
arrange(tailnum)
But in base R
, the issue is that without using a pipe i.e. external package or assignment to a different object, may have to compromise the efficiency by calling duplicated
twice
subset(flights, duplicated(tailnum))[
with(flights, order(tailnum[duplicated(tailnum)])),]
Or in combination with magrittr
subset(flights, duplicated(tailnum)) %>%
.[order(.$tailnum), ]
Upvotes: 1