Reputation: 2684
I'm trying to apply a tidyverse-based approach, or at least a tidy solution, for applying custom functions over the levels of a factor in a dataframe.
Consider the following test dataset:
df <- tibble(LINE=rep(c(1,2),each=6), FOUND=c(1,1,1,0,1,1,0,0,1,0,0,1))
# LINE FOUND
# <dbl> <dbl>
# 1 1 1
# 2 1 1
# 3 1 1
# 4 1 0
# 5 1 1
# 6 1 1
# 7 2 0
# 8 2 0
# 9 2 1
#10 2 0
#11 2 0
#12 2 1
I want to know for example the proportion of found results (eg. FOUND==1) by level of the LINE factor. Right now, I'm working with the following code, but I'm really trying to get to something cleaner.
# This is the function to calculate the proportion "found"
get_prop <- function (data) {
tot <- data %>% nrow()
found <- data %>% dplyr::filter(FOUND==1) %>% nrow
found / tot
}
# This is the code to generate the expected result
lines <- df$LINE %>% unique %>% sort
v_line <- vector()
v_prop <- vector()
for (i in 1:length(lines)) {
tot <- df %>% dplyr::filter(LINE==lines[i])
v_line[i] <- lines[i]
v_prop[i] <- get_prop(tot)
}
df_line = data.frame(LINE = v_line, CALL = v_prop)
I would expect the following to work, but it does not, since its returning the result for each level, but the numerical solution is that of the whole dataset, and not levels-specific:
df %>% dplyr::group_by(LINE) %>% dplyr::summarise(get_prop(.))
EDIT: Please note that what I am looking for is a solution for applying a custom function over the levels of a factor in a dataframe. It is not necessarily the number or the proportion of occurrences of a particular value, as in the example illustrated.
EDIT 2: That is, I'm looking for a solution that makes use of the get_prop
function above. This is not because it is the best way of solving this particular issue, but because it is more generalizable
Upvotes: 2
Views: 1064
Reputation: 2987
Another option could be to use group_map
and then tibble::enframe
library(dplyr)
df %>%
group_by(LINE) %>%
group_map(~get_prop(.)) %>%
unlist() %>%
tibble::enframe()
# name value
# <int> <dbl>
#1 1 0.833
#2 2 0.333
You could also use group_modify
which would keep the group names (using @JBGruber's data)
df %>%
group_by(LINE) %>%
group_modify(~ tibble::enframe(get_prop(.), name = NULL))
# LINE value
# <chr> <dbl>
#1 a 0.833
#2 b 0.333
Upvotes: 2
Reputation: 12410
If you want to apply a custom function group-wise, you can use the group_split
command. This will split your data frame into elements of a list. Each list element being a subset of the df. You can then use map
to apply your function to each level (note that you can group_split
and map
in one step by using group_map
). I added the last line to get to the form of the original approach.
df %>%
group_by(LINE) %>%
group_split() %>%
map_dbl(get_prop) %>%
tibble(LINE = seq_along(.), CALL = .) # optional to get back to a df
#> # A tibble: 2 x 2
#> LINE CALL
#> <int> <dbl>
#> 1 1 0.833
#> 2 2 0.333
Created on 2020-01-20 by the reprex package (v0.3.0)
Now one thing I'm worried about with this solution is that group_split
drops the grouping variable (I would have preferred if it was kept as the names of the list or an attribute). So if you want a tibble as the outcome it might make sense to save the grouping variable beforehand:
groups <- unique(df$LINE)
df %>%
group_by(LINE) %>%
group_split() %>%
map_dbl(get_prop) %>%
tibble(group = groups, result = .)
I think the overall cleanest approach would be this (using a more general example):
library(tidyverse)
df <- tibble(LINE=rep(c("a", "b"),each=6), FOUND=c(1,1,1,0,1,1,0,0,1,0,0,1))
lvls <- unique(df$LINE)
df %>%
group_by(LINE) %>%
group_map(~ get_prop(.x)) %>%
setNames(lvls) %>%
unlist() %>%
enframe()
#> # A tibble: 2 x 2
#> name value
#> <chr> <dbl>
#> 1 a 0.833
#> 2 b 0.333
Created on 2020-01-20 by the reprex package (v0.3.0)
Upvotes: 3