J. Mini
J. Mini

Reputation: 1610

Sort a data frame by its duplicated rows in base R?

Consider the flights data:

library(nycflights13)
flights
# A tibble: 336,776 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>    <int> <chr>  
 1  2013     1     1      517            515         2      830            819        11 UA        1545 N14228 
 2  2013     1     1      533            529         4      850            830        20 UA        1714 N24211 
 3  2013     1     1      542            540         2      923            850        33 AA        1141 N619AA 
 4  2013     1     1      544            545        -1     1004           1022       -18 B6         725 N804JB 
 5  2013     1     1      554            600        -6      812            837       -25 DL         461 N668DN 
 6  2013     1     1      554            558        -4      740            728        12 UA        1696 N39463 
 7  2013     1     1      555            600        -5      913            854        19 B6         507 N516JB 
 8  2013     1     1      557            600        -3      709            723       -14 EV        5708 N829AS 
 9  2013     1     1      557            600        -3      838            846        -8 B6          79 N593JB 
10  2013     1     1      558            600        -2      753            745         8 AA         301 N3ALAA 
# … with 336,766 more rows, and 7 more variables: origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

I'm convinced that many of the tailnum entries are duplicated. anyDuplicated would confirm this, but I want to see the duplicates side-by-side. The best that I could come up with is:

flights[duplicated(flights$tailnum),]->dups
dups[order(dups$tailnum),]
# A tibble: 332,732 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>    <int> <chr>  
 1  2013     3    23     1340           1300        40     1638           1554        44 DL        1685 D942DN 
 2  2013     3    24      859            835        24     1142           1140         2 DL        1959 D942DN 
 3  2013     7     5     1253           1259        -6     1518           1529       -11 DL         781 D942DN 
 4  2013     1     1     2100           2100         0     2307           2250        17 MQ        4584 N0EGMQ 
 5  2013     1     2      827            835        -8     1059           1105        -6 MQ        4610 N0EGMQ 
 6  2013     1     2     2014           2020        -6     2256           2245        11 MQ        4662 N0EGMQ 
 7  2013     1     4     1621           1625        -4     1853           1855        -2 MQ        4661 N0EGMQ 
 8  2013     1     5      834            835        -1     1050           1105       -15 MQ        4610 N0EGMQ 
 9  2013     1     6      832            835        -3     1101           1105        -4 MQ        4610 N0EGMQ 
10  2013     1     6     2051           2100        -9     2241           2250        -9 MQ        4584 N0EGMQ 
# … with 332,722 more rows, and 7 more variables: origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

This gives the desired output, but I find the input extremely ugly. Was there a better way? It seems like a natural use case for piping, but given how long base R has gone without that feature, I expected to find a good alternative to that. I was hoping to be able to compose order and duplicated somehow, but no method has occurred to me.

Upvotes: 1

Views: 39

Answers (2)

jay.sf
jay.sf

Reputation: 72623

You can try this.

r2 <- with(flights, flights[o <- order(r <- rank(tailnum, na.last='keep')), ][duplicated(r[o]), ])

Where

flights[duplicated(flights$tailnum),]->dups
r1 <- dups[order(dups$tailnum),]
stopifnot(all.equal(r1, r2))

Upvotes: 0

akrun
akrun

Reputation: 886948

In tidyverse, this can be piped

library(dplyr)
flights %>%
    filter(duplicated(tailnum)) %>% 
    arrange(tailnum)

But in base R, the issue is that without using a pipe i.e. external package or assignment to a different object, may have to compromise the efficiency by calling duplicated twice

subset(flights, duplicated(tailnum))[
     with(flights, order(tailnum[duplicated(tailnum)])),]

Or in combination with magrittr

subset(flights, duplicated(tailnum)) %>%
       .[order(.$tailnum), ] 

Upvotes: 1

Related Questions