Counting unique occurrence of two values per ID, considering presence of third value

Question

The title is not quite cool - I apologise that I was not able to summarise the question better. I am conceptually a bit lost and wondered if there is a better approach for the following:

What I have: I have two columns, ID and eye. Eyes can be coded as "r", "l" or "b" (right/ left/ both eyes). It does not have to contain all values, and it can include NA.

What I want: I want to count number of distinct eyes by ID. If "b" is occurring, "r|l" for the same ID should not be counted (because "right | left eye" is part of "both eyes").

Ideally base R only: My approach uses base R only, and I would much prefer a base R solution, because this is intended for a package. (Actually, the core of this function is already part of a package, but I wonder if this can be improved).

Other solutions very welcome: The final function is also to be applied on data frames with 10^6 rows and thousands of IDs, so should be fast computation by group. My solution seems already fairly fast, (I have not done a formal test though). I would therefore also think any dplyr::group_by solution would not be an option (at least in my approaches).

# sample data
set.seed(42)
id <- letters[sample(11, replace = TRUE)]
foo1 <- data.frame(id, eye = sample(c("r", "l", "b"), 11, replace = TRUE))
foo2 <- data.frame(id, eye = "r")
foo3 <- foo2
foo3$eye[1:5] <- NA
foo4 <- data.frame(id, eye = "b")

count_eyes <- function(x, pat_col, eye) {
  # reduce to unique combinations of patient and eye, then count occurrence of 
  # "eye" by patient. Results in matrix of 0/1
  eye_tab <- table(unique(x[, c(pat_col, eye)]))
  # cases where "b" does not exist must also work (foo2 and foo3)
  if(any(grepl("b", colnames(eye_tab)))){
    
  # whenever "b" is present, "r" and "l" will be set to 0, 
  # so it will not be counted in the next step
  # "r" and "l" might not occur
    if(any(grepl("r|l", colnames(eye_tab)))){
      eye_tab[, c("r","l")][eye_tab[, "b"] == 1] <- 0
    }
  }
  # I chose the programmatic approach because the column names might not be present 
  # I add all 1 for each column. Because r is set to 0 previously, I have to 
  # add the count for b again to get the real number of right eyes. 
  n_b <- unname(colSums(eye_tab[, colnames(eye_tab) == "b", drop = FALSE]))
  n_right <- sum(unname(colSums(eye_tab[, colnames(eye_tab) == "r", drop = FALSE])), n_b)
  n_left <- sum(unname(colSums(eye_tab[, colnames(eye_tab) == "l", drop = FALSE])), n_b)
  
  c(r = n_right, l = n_left)
}

expected result

lapply(mget(c("foo1", "foo2", "foo3", "foo4")), count_eyes, pat_col = "id", eye = "eye")
#> $foo1
#> r l 
#> 7 6 
#> 
#> $foo2
#> r l 
#> 8 0 
#> 
#> $foo3
#> r l 
#> 6 0 
#> 
#> $foo4
#> r l 
#> 8 8

akrun · Accepted Answer

The code could be shortened if we convert the column to factor with levels specified

count_eyes <- function(x, pat_col, eye) {
  nm1 <- c('r', 'l')
  x$eye <- factor(x$eye, levels = c("b", nm1)) # // convert to factor
  # reduce to unique combinations of patient and eye, then count occurrence of 
  # "eye" by patient. Results in matrix of 0/1
  
  eye_tab <- table(unique(x[, c(pat_col, eye)]))
  # cases where "b" does not exist must also work (foo2 and foo3)
  if(any(grepl("b", colnames(eye_tab)))){
    
  # whenever "b" is present, "r" and "l" will be set to 0, 
  # so it will not be counted in the next step
  # "r" and "l" might not occur
    if(any(grepl(paste(nm1, collapse="|"), colnames(eye_tab)))){
      eye_tab[, nm1][eye_tab[, "b"] == 1] <- 0
    }
  }
  
  out <- colSums(eye_tab)      
  out[nm1] + out['b']
  
  
}

-testing

lapply(mget(paste0('foo', 1:4)), count_eyes, pat_col = "id", eye = "eye")
#$foo1
#r l 
#7 6 

#$foo2
#r l 
#8 0 

#$foo3
#r l 
#6 0 

#$foo4
#r l 
#8 8

Counting unique occurrence of two values per ID, considering presence of third value

Answers (2)

Related Questions