Reputation: 35
I have a dataframe with groups of columns that start with the same letter (e.g. AA, AB, AC etc.). I need to produce a summary that counts the number of rows where all columns in each of the column groups match a specific value (e.g. 1).
If this is my dataframe (df)
AA_1 AA_2 AB_1 AB_2 AB_3 AC_1 AC_2
1 1 1 0 1 1 0 1
2 1 1 0 0 0 1 0
3 1 0 1 1 1 0 0
I need a summary like this:
AA 2
AB 1
AC 0
I am able to retrieve the grouped row sums by using sapply with a list of the unique column names in the dataframe:
groups <- unique(substr(names(df), 1, 2))
sapply(groups, function(xx) rowSums(df[,grep(xx, names(df)), drop=FALSE]))
Output:
AA AB AC
[1,] 1 2 1
[2,] 2 0 1
[3,] 2 3 0
But can't quite figure out how to modify this to summarise conditionally
Upvotes: 3
Views: 101
Reputation: 21938
Here is how you could do it in base R
in one step:
do.call(rbind, lapply(split.default(df, sub("([[:alpha:]]+)_\\d+", "\\1", names(df))),
\(x) {sum(apply(x, 1, \(x) all(x == 1)))}))
[,1]
AA 2
AB 1
AC 0
Upvotes: 5
Reputation: 9878
An option with purrr
and dplyr
. After extracting the groups with a regex, loop through the groups and summarise
with sum
.
The key here is the if_all
function, which returns TRUE if_all
values in the selection (starts_with(.x)
) are TRUE (or 1):
groups<-map_chr(names(df), ~str_extract(.x, '^[A-Z]{2}')) %>%
unique()
map_dfc(groups, ~df %>% summarise(sum(if_all(starts_with(.x)))))%>%
set_names(groups)%>%
pivot_longer(everything(), names_to = 'group')
# A tibble: 3 x 2
group value
<chr> <int>
1 AA 2
2 AB 1
3 AC 0
Upvotes: 1
Reputation: 4344
and here goes one tidyverse
approach:
df <- data.table::fread("AA_1 AA_2 AB_1 AB_2 AB_3 AC_1 AC_2
1 1 0 1 1 0 1
1 1 0 0 0 1 0
1 0 1 1 1 0 0")
library(tidyverse)
df %>%
# prepare data by generating new colum from row number
dplyr::mutate(rn = dplyr::row_number()) %>%
# convert to long format, separating into two columns by simplified regex
tidyr::pivot_longer(-rn,
names_to = c("letrs", "nbrs"),
names_pattern = "(.*)_(.*)",
values_to = "vals") %>%
# build grouping by letter and row number column
dplyr::group_by(letrs, rn) %>%
# if all values in group are 1, then return one else 0
dplyr::summarise(res = ifelse(all(vals) == 1, 1, 0)) %>%
# summarise the intermediate which was per letter and row number to letters now
dplyr::summarise(res = sum(res))
# A tibble: 3 x 2
letrs res
<chr> <dbl>
1 AA 2
2 AB 1
3 AC 0
Upvotes: 1
Reputation: 27792
a data.table
approach
library(data.table)
DT <- fread("AA_1 AA_2 AB_1 AB_2 AB_3 AC_1 AC_2
1 1 0 1 1 0 1
1 1 0 0 0 1 0
1 0 1 1 1 0 0")
#split to list using colnames
L <- split.default(DT, f = gsub("(^..).*", "\\1", names(DT)))
#value to look for
test_value <- 1
#loop over list, see if all columns (rowwise) match the test_value
answer <- lapply(L, function(x) data.table(val = nrow(x[rowSums(x == test_value, na.rm = TRUE) == ncol(x)])))
#rowbind answer using the names from the answer
data.table::rbindlist(answer, id = "group")
# group val
# 1: AA 2
# 2: AB 1
# 3: AC 0
Upvotes: 1