Apply a custom function over levels of a factor in a dataframe

Question

I'm trying to apply a tidyverse-based approach, or at least a tidy solution, for applying custom functions over the levels of a factor in a dataframe.

Consider the following test dataset:

df <- tibble(LINE=rep(c(1,2),each=6), FOUND=c(1,1,1,0,1,1,0,0,1,0,0,1))

#    LINE FOUND
#    
# 1     1     1
# 2     1     1
# 3     1     1
# 4     1     0
# 5     1     1
# 6     1     1
# 7     2     0
# 8     2     0
# 9     2     1
#10     2     0
#11     2     0
#12     2     1

I want to know for example the proportion of found results (eg. FOUND==1) by level of the LINE factor. Right now, I'm working with the following code, but I'm really trying to get to something cleaner.

# This is the function to calculate the proportion "found"
get_prop <- function (data) {
  tot <- data %>% nrow()
  found <- data %>% dplyr::filter(FOUND==1) %>% nrow
  found / tot
}

# This is the code to generate the expected result
lines <- df$LINE %>% unique %>% sort
v_line <- vector()
v_prop <- vector()
for (i in 1:length(lines)) {
  tot <- df %>% dplyr::filter(LINE==lines[i])
  v_line[i] <- lines[i]
  v_prop[i] <- get_prop(tot)
}
df_line = data.frame(LINE = v_line, CALL = v_prop)

I would expect the following to work, but it does not, since its returning the result for each level, but the numerical solution is that of the whole dataset, and not levels-specific:

df %>% dplyr::group_by(LINE) %>% dplyr::summarise(get_prop(.))

EDIT: Please note that what I am looking for is a solution for applying a custom function over the levels of a factor in a dataframe. It is not necessarily the number or the proportion of occurrences of a particular value, as in the example illustrated.

EDIT 2: That is, I'm looking for a solution that makes use of the get_prop function above. This is not because it is the best way of solving this particular issue, but because it is more generalizable

JBGruber · Accepted Answer

If you want to apply a custom function group-wise, you can use the group_split command. This will split your data frame into elements of a list. Each list element being a subset of the df. You can then use map to apply your function to each level (note that you can group_split and map in one step by using group_map). I added the last line to get to the form of the original approach.

df %>% 
  group_by(LINE) %>% 
  group_split() %>% 
  map_dbl(get_prop) %>% 
  tibble(LINE = seq_along(.), CALL = .) # optional to get back to a df
#> # A tibble: 2 x 2
#>    LINE  CALL
#>    
#> 1     1 0.833
#> 2     2 0.333

^{Created on 2020-01-20 by the reprex package (v0.3.0)}

Now one thing I'm worried about with this solution is that group_split drops the grouping variable (I would have preferred if it was kept as the names of the list or an attribute). So if you want a tibble as the outcome it might make sense to save the grouping variable beforehand:

groups <- unique(df$LINE)

df %>% 
  group_by(LINE) %>% 
  group_split() %>% 
  map_dbl(get_prop) %>% 
  tibble(group = groups, result = .)

update

I think the overall cleanest approach would be this (using a more general example):

library(tidyverse)
df <- tibble(LINE=rep(c("a", "b"),each=6), FOUND=c(1,1,1,0,1,1,0,0,1,0,0,1))

lvls <- unique(df$LINE)

df %>% 
  group_by(LINE) %>% 
  group_map(~ get_prop(.x)) %>% 
  setNames(lvls) %>% 
  unlist() %>% 
  enframe()
#> # A tibble: 2 x 2
#>   name  value
#>    
#> 1 a     0.833
#> 2 b     0.333

^{Created on 2020-01-20 by the reprex package (v0.3.0)}

Apply a custom function over levels of a factor in a dataframe

Answers (2)

update

Related Questions