Reputation: 23807
The title is not quite cool - I apologise that I was not able to summarise the question better. I am conceptually a bit lost and wondered if there is a better approach for the following:
What I have:
I have two columns, ID
and eye
. Eyes can be coded as "r", "l" or "b" (right/ left/ both eyes). It does not have to contain all values, and it can include NA.
What I want: I want to count number of distinct eyes by ID. If "b" is occurring, "r|l" for the same ID should not be counted (because "right | left eye" is part of "both eyes").
Ideally base R only: My approach uses base R only, and I would much prefer a base R solution, because this is intended for a package. (Actually, the core of this function is already part of a package, but I wonder if this can be improved).
Other solutions very welcome:
The final function is also to be applied on data frames with 10^6 rows and thousands of IDs, so should be fast computation by group. My solution seems already fairly fast, (I have not done a formal test though). I would therefore also think any dplyr::group_by
solution would not be an option (at least in my approaches).
# sample data
set.seed(42)
id <- letters[sample(11, replace = TRUE)]
foo1 <- data.frame(id, eye = sample(c("r", "l", "b"), 11, replace = TRUE))
foo2 <- data.frame(id, eye = "r")
foo3 <- foo2
foo3$eye[1:5] <- NA
foo4 <- data.frame(id, eye = "b")
count_eyes <- function(x, pat_col, eye) {
# reduce to unique combinations of patient and eye, then count occurrence of
# "eye" by patient. Results in matrix of 0/1
eye_tab <- table(unique(x[, c(pat_col, eye)]))
# cases where "b" does not exist must also work (foo2 and foo3)
if(any(grepl("b", colnames(eye_tab)))){
# whenever "b" is present, "r" and "l" will be set to 0,
# so it will not be counted in the next step
# "r" and "l" might not occur
if(any(grepl("r|l", colnames(eye_tab)))){
eye_tab[, c("r","l")][eye_tab[, "b"] == 1] <- 0
}
}
# I chose the programmatic approach because the column names might not be present
# I add all 1 for each column. Because r is set to 0 previously, I have to
# add the count for b again to get the real number of right eyes.
n_b <- unname(colSums(eye_tab[, colnames(eye_tab) == "b", drop = FALSE]))
n_right <- sum(unname(colSums(eye_tab[, colnames(eye_tab) == "r", drop = FALSE])), n_b)
n_left <- sum(unname(colSums(eye_tab[, colnames(eye_tab) == "l", drop = FALSE])), n_b)
c(r = n_right, l = n_left)
}
expected result
lapply(mget(c("foo1", "foo2", "foo3", "foo4")), count_eyes, pat_col = "id", eye = "eye")
#> $foo1
#> r l
#> 7 6
#>
#> $foo2
#> r l
#> 8 0
#>
#> $foo3
#> r l
#> 6 0
#>
#> $foo4
#> r l
#> 8 8
Upvotes: 1
Views: 40
Reputation: 24888
Here's another approach with split
and rowSums
:
count_eyes <- function(x , pat_col, eye){
rowSums(sapply(split(subset(x,select = eye),
subset(x,select = pat_col)),
function(y){c(r = any(y %in% c("b", "r")),
l = any(y %in% c("b", "l")))
}))}
lapply(mget(ls(pattern="foo")),count_eyes, "id", "eye")
$foo1
r l
5 4
$foo2
r l
6 0
$foo3
r l
4 0
$foo4
r l
6 6
Upvotes: 2
Reputation: 887971
The code could be shortened if we convert the column to factor
with levels
specified
count_eyes <- function(x, pat_col, eye) {
nm1 <- c('r', 'l')
x$eye <- factor(x$eye, levels = c("b", nm1)) # // convert to factor
# reduce to unique combinations of patient and eye, then count occurrence of
# "eye" by patient. Results in matrix of 0/1
eye_tab <- table(unique(x[, c(pat_col, eye)]))
# cases where "b" does not exist must also work (foo2 and foo3)
if(any(grepl("b", colnames(eye_tab)))){
# whenever "b" is present, "r" and "l" will be set to 0,
# so it will not be counted in the next step
# "r" and "l" might not occur
if(any(grepl(paste(nm1, collapse="|"), colnames(eye_tab)))){
eye_tab[, nm1][eye_tab[, "b"] == 1] <- 0
}
}
out <- colSums(eye_tab)
out[nm1] + out['b']
}
-testing
lapply(mget(paste0('foo', 1:4)), count_eyes, pat_col = "id", eye = "eye")
#$foo1
#r l
#7 6
#$foo2
#r l
#8 0
#$foo3
#r l
#6 0
#$foo4
#r l
#8 8
Upvotes: 3