Eldur
Eldur

Reputation: 35

Summarise column groups based on whether all columns within the group match a specific value

I have a dataframe with groups of columns that start with the same letter (e.g. AA, AB, AC etc.). I need to produce a summary that counts the number of rows where all columns in each of the column groups match a specific value (e.g. 1).

If this is my dataframe (df)

  AA_1 AA_2 AB_1 AB_2 AB_3 AC_1 AC_2
1    1    1    0    1    1    0    1
2    1    1    0    0    0    1    0
3    1    0    1    1    1    0    0

I need a summary like this:

AA    2    
AB    1      
AC    0    

I am able to retrieve the grouped row sums by using sapply with a list of the unique column names in the dataframe:

 groups <- unique(substr(names(df), 1, 2))
 sapply(groups, function(xx) rowSums(df[,grep(xx, names(df)), drop=FALSE]))

Output:

     AA AB AC
[1,]  1  2  1
[2,]  2  0  1
[3,]  2  3  0

But can't quite figure out how to modify this to summarise conditionally

Upvotes: 3

Views: 101

Answers (4)

Anoushiravan R
Anoushiravan R

Reputation: 21938

Here is how you could do it in base R in one step:

do.call(rbind, lapply(split.default(df, sub("([[:alpha:]]+)_\\d+", "\\1", names(df))), 
                      \(x) {sum(apply(x, 1, \(x) all(x == 1)))}))

   [,1]
AA    2
AB    1
AC    0

Upvotes: 5

GuedesBF
GuedesBF

Reputation: 9878

An option with purrr and dplyr. After extracting the groups with a regex, loop through the groups and summarise with sum. The key here is the if_all function, which returns TRUE if_all values in the selection (starts_with(.x)) are TRUE (or 1):

groups<-map_chr(names(df), ~str_extract(.x, '^[A-Z]{2}')) %>%
        unique()

map_dfc(groups, ~df %>% summarise(sum(if_all(starts_with(.x)))))%>%
        set_names(groups)%>%
        pivot_longer(everything(), names_to = 'group')

# A tibble: 3 x 2
  group value
  <chr> <int>
1 AA        2
2 AB        1
3 AC        0

Upvotes: 1

DPH
DPH

Reputation: 4344

and here goes one tidyverse approach:

df <- data.table::fread("AA_1 AA_2 AB_1 AB_2 AB_3 AC_1 AC_2
 1    1    0    1    1    0    1
 1    1    0    0    0    1    0
 1    0    1    1    1    0    0") 

library(tidyverse)

df %>% 
    # prepare data by generating new colum from row number
    dplyr::mutate(rn = dplyr::row_number()) %>%
    # convert to long format, separating into two columns by simplified regex
    tidyr::pivot_longer(-rn, 
                        names_to = c("letrs", "nbrs"),
                        names_pattern = "(.*)_(.*)",
                        values_to = "vals") %>%
    # build grouping by letter and row number column
    dplyr::group_by(letrs, rn) %>%
    # if all values in group are 1, then return one else 0
    dplyr::summarise(res = ifelse(all(vals) == 1, 1, 0)) %>%
    # summarise the intermediate which was per letter and row number to letters now
    dplyr::summarise(res = sum(res))

# A tibble: 3 x 2
  letrs   res
  <chr> <dbl>
1 AA        2
2 AB        1
3 AC        0

Upvotes: 1

Wimpel
Wimpel

Reputation: 27792

a data.table approach

library(data.table)
DT <- fread("AA_1 AA_2 AB_1 AB_2 AB_3 AC_1 AC_2
    1    1    0    1    1    0    1
    1    1    0    0    0    1    0
    1    0    1    1    1    0    0")

#split to list using colnames
L <- split.default(DT, f = gsub("(^..).*", "\\1", names(DT)))
#value to look for
test_value <- 1
#loop over list, see if all columns (rowwise) match the test_value
answer <- lapply(L, function(x) data.table(val = nrow(x[rowSums(x == test_value, na.rm = TRUE) == ncol(x)])))
#rowbind answer using the names from the answer
data.table::rbindlist(answer, id = "group")
#    group val
# 1:    AA   2
# 2:    AB   1
# 3:    AC   0

Upvotes: 1

Related Questions