Reputation: 93
I have a data frame with patient encounters, and want to extract only the oldest encounter for each patient (which can be done using the sequential encounter ID). The code I came up with works, but I'm sure there are more efficient ways to perform this task using dplyr. What approach would you recommend?
Example with 10 encounters for 4 patients:
encounter_ID <- c(1021, 1022, 1013, 1041, 1007, 1002, 1003, 1043, 1085, 1077)
patient_ID <- c(855,721,821,855,423,423,855,721,423,855)
gender <- c(0,0,1,0,1,1,0,0,1,0)
df <- data.frame(encounter_ID, patient_ID, gender)
Result (desired and obtained):
encounter_ID patient_ID gender
1003 855 0
1022 721 0
1013 821 1
1002 423 1
My approach
1) Extract a list of the unique patients
list.patients <- unique(df$patient_ID)
2) Create an empty data frame to receive our output of the first encounter per patient
one.encounter <- data.frame()
3) Go through each patient on the list to extract their first encounter and populate our data frame
for (i in 1:length(list.patients)) {
one.patient <- df %>% filter(patient_ID==list.patients[i])
one.patient.ordered <- one.patient[order(one.patient$encounter_ID),]
first.encounter <- head(one.patient.ordered, n=1)
one.encounter <- rbind(one.encounter, first.encounter)
}
Upvotes: 4
Views: 179
Reputation: 26343
Since OP asked for an efficient method in terms of execution time, here is a benchmark of the answers in addition to a data.table
way.
#Unit: milliseconds
# expr min lq mean median uq max neval
# OP(df) 1354.49200 1398.15245 1481.16068 1467.31151 1531.93056 2124.05586 100
# Mike(df) 587.33074 606.33194 649.87766 621.65719 658.96548 1076.12302 100
# Fernandes(df) 177.80735 182.97910 206.64074 185.91444 198.83281 430.96393 100
# `5th`(df) 60.55170 64.98082 77.55248 67.73171 71.54677 208.47656 100
# SmitM(df) 52.70000 53.93696 59.05506 54.84035 58.92260 175.24284 100
# Jan_Boyer(df) 30.70666 33.44665 43.04396 34.46983 35.69736 223.02998 100
# data_table(df) 11.51547 12.38410 14.60907 13.08038 15.25540 43.71229 100
# Moody_dplyr(df) 234.08792 241.02003 260.19283 245.20301 259.82435 517.03117 100
# Moody_baseR(df) 67.05192 72.00578 89.50914 74.64688 77.58169 299.56125 100
code and data
library(microbenchmark)
library(tidyverse)
library(data.table)
n <- 1e6
set.seed(1)
df <- data.frame(encounter_ID = sample(1000:1999, size = n, replace = TRUE),
patient_ID = sample(700:900, n, TRUE),
gender = sample(0:1, n, TRUE))
benchmark <- microbenchmark(
OP(df),
Mike(df),
Fernandes(df),
`5th`(df),
SmitM(df),
Jan_Boyer(df),
data_table(df),
Moody_dplyr(df),
Moody_baseR(df)
)
autoplot(benchmark)
The solutions so far.
Mike <- function(df) {
df %>%
arrange(patient_ID, encounter_ID) %>%
group_by(patient_ID) %>%
filter(row_number()==1)
}
SmitM <- function(df) {
df %>%
group_by(patient_ID, gender) %>%
summarise(encounter_ID = min(encounter_ID))
}
Fernandes <- function(df) {
x <- dplyr::arrange(df, encounter_ID)
x[!duplicated(x$patient_ID),]
}
`5th` <- function(df) {
df_ordered <- df[order(df$patient_ID, df$encounter_ID), ]
df_ordered[match(unique(df_ordered$patient_ID), df_ordered$patient_ID), ]
}
Jan_Boyer <- function(df) {
df <- df[order(df$encounter_ID),]
df[!duplicated(df$patient_ID),]
}
data_table <- function(df) {
setDT(df, key = 'encounter_ID')
df[df[, .I[1], by = patient_ID]$V1]
}
OP <- function(df) {
list.patients <- unique(df$patient_ID)
one.encounter <- data.frame()
for (i in 1:length(list.patients)) {
one.patient <- df %>% filter(patient_ID == list.patients[i])
one.patient.ordered <- one.patient[order(one.patient$encounter_ID), ]
first.encounter <- head(one.patient.ordered, n = 1)
one.encounter <- rbind(one.encounter, first.encounter)
}
}
Moody_dplyr <- function(df) {
df %>% group_by(patient_ID) %>% top_n(-1,encounter_ID)
}
Moody_baseR <- function(df) {
subset(df, as.logical(ave(encounter_ID, patient_ID, FUN = function(x) x == min(x))))
}
Upvotes: 4
Reputation: 47300
You could use top_n
:
library(dplyr)
df %>% group_by(patient_ID) %>% top_n(-1,encounter_ID)
# # A tibble: 4 x 3
# # Groups: patient_ID [4]
# encounter_ID patient_ID gender
# <dbl> <dbl> <dbl>
# 1 1022 721 0
# 2 1013 821 1
# 3 1002 423 1
# 4 1003 855 0
It's not super fast but it's the idomatic dplyr
way.
With base R
this is much faster:
subset(df, as.logical(ave(encounter_ID, patient_ID, FUN = function(x) x == min(x))))
Upvotes: 1
Reputation: 2375
Generally R works fastest if you vectorize operations. Therefore the question is what you mean when you ask for more efficient ways to solve this?
To illustrate this I show you a solution in base R
and run a microbenchmark
:
microbenchmark::microbenchmark(myfun1(),myfun2(),myfun3())
Unit: microseconds
expr min lq mean median uq max neval
myfun1() 3997.1 4416.10 6086.848 5129.65 6215.6 64014.4 100
myfun2() 834.7 993.50 1404.901 1083.95 1247.5 20456.2 100
myfun3() 133.3 162.75 258.533 193.75 233.8 3561.7 100
Your solution is myfun1()
, @SmitM dplyr
-version is myfun2()
and my solution (myfun3
) looks like this:
df_ordered=df[order(df$patient_ID,df$encounter_ID),]
df_ordered[match(unique(df_ordered$patient_ID),df_ordered$patient_ID),]
Now you can choose what you like most: dplyr
solutions are very nice to read and I think also can be exported into other programming languages. The base R
solutions are very fast, but usually not as nice to read and to the best of my knowledge can't be exported into other languages.
I posted the base R
-version here because it is relatively nice to read, because every function does what it is called like - still dplyr
looks nicer though.
Upvotes: 3
Reputation: 1580
Here is a base R solution, it is possible to do this efficiently without dplyr
duplicated
will code the first row it encounters with a certain patient id as FALSE
, and all subsequent rows with that same patient id as TRUE
(Here, I've reversed that by adding !
before duplicated
), so you can use it to select only the first encounter if you've ordered your dataframe by encounter_ID
df <- df[order(df$encounter_ID),] #order dataframe by encounter id
#subset to rows that are not duplicates of a previous encounter for that patient
first <- df[!duplicated(df$patient_ID),]
Upvotes: 4
Reputation: 273
Another option
x = dplyr::arrange(df, encounter_ID)
x[!duplicated(x$patient_ID),]
# encounter_ID patient_ID gender
#1 1002 423 1
#2 1003 855 0
#4 1013 821 1
#6 1022 721 0
Upvotes: 1
Reputation: 1376
You can try:
df2 <- df %>%
group_by(patient_ID, gender) %>%
summarise(encounter_ID = min(encounter_ID))
Upvotes: 3
Reputation: 4370
something like this In the dplyr
code below I would sort by the two ids then group by patient. Using row_numer()==1
in the filter statement will grab the smallest encouter_id
per patient because you sorted by both variables and group_by patient_id. :
encounter_ID <- c(1021, 1022, 1013, 1041, 1007, 1002, 1003, 1043, 1085, 1077)
patient_ID <- c(855,721,821,855,423,423,855,721,423,855)
gender <- c(0,0,1,0,1,1,0,0,1,0)
df <- data.frame(encounter_ID, patient_ID, gender)
library(dplyr)
df2 <- df %>%
arrange(patient_ID, encounter_ID) %>%
group_by(patient_ID) %>%
filter(row_number()==1)
Upvotes: 1