A. Beal
A. Beal

Reputation: 93

What's an efficient method to extract only the rows with the first occurrence from a data set?

I have a data frame with patient encounters, and want to extract only the oldest encounter for each patient (which can be done using the sequential encounter ID). The code I came up with works, but I'm sure there are more efficient ways to perform this task using dplyr. What approach would you recommend?

Example with 10 encounters for 4 patients:

encounter_ID <- c(1021, 1022, 1013, 1041, 1007, 1002, 1003, 1043, 1085, 1077)
patient_ID <- c(855,721,821,855,423,423,855,721,423,855)
gender <- c(0,0,1,0,1,1,0,0,1,0)
df <- data.frame(encounter_ID, patient_ID, gender)

Result (desired and obtained):

    encounter_ID    patient_ID  gender
    1003            855         0
    1022            721         0
    1013            821         1
    1002            423         1

My approach

1) Extract a list of the unique patients

list.patients <- unique(df$patient_ID)

2) Create an empty data frame to receive our output of the first encounter per patient

one.encounter <- data.frame()

3) Go through each patient on the list to extract their first encounter and populate our data frame

for (i in 1:length(list.patients)) {
one.patient <- df %>% filter(patient_ID==list.patients[i])
one.patient.ordered <- one.patient[order(one.patient$encounter_ID),]
first.encounter <- head(one.patient.ordered, n=1)
one.encounter <- rbind(one.encounter, first.encounter)
} 

Upvotes: 4

Views: 179

Answers (7)

markus
markus

Reputation: 26343

Since OP asked for an efficient method in terms of execution time, here is a benchmark of the answers in addition to a data.table way.

enter image description here

#Unit: milliseconds
#            expr        min         lq       mean     median         uq        max neval
#          OP(df) 1354.49200 1398.15245 1481.16068 1467.31151 1531.93056 2124.05586   100
#        Mike(df)  587.33074  606.33194  649.87766  621.65719  658.96548 1076.12302   100
#   Fernandes(df)  177.80735  182.97910  206.64074  185.91444  198.83281  430.96393   100
#       `5th`(df)   60.55170   64.98082   77.55248   67.73171   71.54677  208.47656   100
#       SmitM(df)   52.70000   53.93696   59.05506   54.84035   58.92260  175.24284   100
#   Jan_Boyer(df)   30.70666   33.44665   43.04396   34.46983   35.69736  223.02998   100
#  data_table(df)   11.51547   12.38410   14.60907   13.08038   15.25540   43.71229   100
# Moody_dplyr(df)  234.08792  241.02003  260.19283  245.20301  259.82435  517.03117   100
# Moody_baseR(df)   67.05192   72.00578   89.50914   74.64688   77.58169  299.56125   100

code and data

library(microbenchmark)
library(tidyverse)
library(data.table)

n <- 1e6
set.seed(1)
df <- data.frame(encounter_ID = sample(1000:1999, size = n, replace = TRUE), 
                 patient_ID = sample(700:900, n, TRUE), 
                 gender = sample(0:1, n, TRUE))

benchmark <- microbenchmark(
  OP(df),
  Mike(df),
  Fernandes(df),
  `5th`(df),
  SmitM(df),
  Jan_Boyer(df),
  data_table(df),
  Moody_dplyr(df),
  Moody_baseR(df)
)

autoplot(benchmark)

The solutions so far.

Mike <- function(df) {
  df %>%  
    arrange(patient_ID, encounter_ID) %>% 
    group_by(patient_ID) %>% 
    filter(row_number()==1)
}

SmitM <- function(df) {
  df %>% 
    group_by(patient_ID, gender) %>% 
    summarise(encounter_ID = min(encounter_ID))
}

Fernandes <- function(df) {
  x <- dplyr::arrange(df, encounter_ID)
  x[!duplicated(x$patient_ID),]
}

`5th` <- function(df) {
  df_ordered <- df[order(df$patient_ID, df$encounter_ID), ]
  df_ordered[match(unique(df_ordered$patient_ID), df_ordered$patient_ID), ]
}

Jan_Boyer <- function(df) {
  df <- df[order(df$encounter_ID),] 
  df[!duplicated(df$patient_ID),]
}

data_table <- function(df) {
  setDT(df, key = 'encounter_ID')
  df[df[, .I[1], by = patient_ID]$V1]
}

OP <- function(df) {
  list.patients <- unique(df$patient_ID)
  one.encounter <- data.frame()

  for (i in 1:length(list.patients)) {
    one.patient <- df %>% filter(patient_ID == list.patients[i])
    one.patient.ordered <- one.patient[order(one.patient$encounter_ID), ]
    first.encounter <- head(one.patient.ordered, n = 1)
    one.encounter <- rbind(one.encounter, first.encounter)
  } 
}

Moody_dplyr <- function(df) {
  df %>% group_by(patient_ID) %>% top_n(-1,encounter_ID)
}

Moody_baseR <- function(df) {
  subset(df, as.logical(ave(encounter_ID, patient_ID, FUN = function(x) x == min(x))))
}

Upvotes: 4

moodymudskipper
moodymudskipper

Reputation: 47300

You could use top_n :

library(dplyr)
df %>% group_by(patient_ID) %>% top_n(-1,encounter_ID)
# # A tibble: 4 x 3
# # Groups:   patient_ID [4]
#   encounter_ID patient_ID gender
#          <dbl>      <dbl>  <dbl>
# 1         1022        721      0
# 2         1013        821      1
# 3         1002        423      1
# 4         1003        855      0

It's not super fast but it's the idomatic dplyr way.

With base R this is much faster:

subset(df, as.logical(ave(encounter_ID, patient_ID, FUN = function(x) x == min(x))))

Upvotes: 1

5th
5th

Reputation: 2375

Generally R works fastest if you vectorize operations. Therefore the question is what you mean when you ask for more efficient ways to solve this?

To illustrate this I show you a solution in base R and run a microbenchmark:

microbenchmark::microbenchmark(myfun1(),myfun2(),myfun3())
Unit: microseconds
     expr    min      lq     mean  median     uq     max neval
 myfun1() 3997.1 4416.10 6086.848 5129.65 6215.6 64014.4   100
 myfun2()  834.7  993.50 1404.901 1083.95 1247.5 20456.2   100
 myfun3()  133.3  162.75  258.533  193.75  233.8  3561.7   100

Your solution is myfun1(), @SmitM dplyr-version is myfun2() and my solution (myfun3) looks like this:

df_ordered=df[order(df$patient_ID,df$encounter_ID),]
df_ordered[match(unique(df_ordered$patient_ID),df_ordered$patient_ID),]

Now you can choose what you like most: dplyr solutions are very nice to read and I think also can be exported into other programming languages. The base R solutions are very fast, but usually not as nice to read and to the best of my knowledge can't be exported into other languages.

I posted the base R-version here because it is relatively nice to read, because every function does what it is called like - still dplyr looks nicer though.

Upvotes: 3

Jan Boyer
Jan Boyer

Reputation: 1580

Here is a base R solution, it is possible to do this efficiently without dplyr

duplicated will code the first row it encounters with a certain patient id as FALSE, and all subsequent rows with that same patient id as TRUE (Here, I've reversed that by adding ! before duplicated), so you can use it to select only the first encounter if you've ordered your dataframe by encounter_ID

df <- df[order(df$encounter_ID),] #order dataframe by encounter id
#subset to rows that are not duplicates of a previous encounter for that patient
first <- df[!duplicated(df$patient_ID),]

Upvotes: 4

Thiago Fernandes
Thiago Fernandes

Reputation: 273

Another option

x = dplyr::arrange(df, encounter_ID)
x[!duplicated(x$patient_ID),]
#  encounter_ID patient_ID gender
#1         1002        423      1
#2         1003        855      0
#4         1013        821      1
#6         1022        721      0

Upvotes: 1

SmitM
SmitM

Reputation: 1376

You can try:

df2 <- df %>% 
          group_by(patient_ID, gender) %>% 
          summarise(encounter_ID = min(encounter_ID))

Upvotes: 3

Mike
Mike

Reputation: 4370

something like this In the dplyr code below I would sort by the two ids then group by patient. Using row_numer()==1 in the filter statement will grab the smallest encouter_id per patient because you sorted by both variables and group_by patient_id. :

encounter_ID <- c(1021, 1022, 1013, 1041, 1007, 1002, 1003, 1043, 1085, 1077)
patient_ID <- c(855,721,821,855,423,423,855,721,423,855)
gender <- c(0,0,1,0,1,1,0,0,1,0)
df <- data.frame(encounter_ID, patient_ID, gender)

library(dplyr)



df2 <- df %>%  
        arrange(patient_ID, encounter_ID) %>% 
        group_by(patient_ID) %>% 
        filter(row_number()==1)

Upvotes: 1

Related Questions