E___F
E___F

Reputation: 1

Using lapply to sum a subset of a dataframe

I'm quite new to R and using lapply. I have a large dataframe and I'm attempting to use lapply to output the sum of some subsets of this dataframe.

group_a group_b n_variants_a n_variants_b
1 NA 1 2
NA 2 5 4
1 2 2 0

I want to look at subsets based on multiple different groups (group_a, group_b) and sum each column of n_variants.

Running this over just one group and n_variant set works:

sum(subset(df, (!is.na(group_a)))$n_variants_a 

However I want to sum every n_variant column based on every grouping. My lapply script for this outputs values of 0 for each sum.

summed_variants <- lapply(list_of_groups, function(g) {
              lapply(list_of_variants, function(v) {
                sum(subset(df, !(is.na(g)))$v)

I was wondering if I need to use paste0 to paste the list of variants in, but I couldn't get this to work.

Thanks for your help!

Upvotes: 0

Views: 514

Answers (1)

akrun
akrun

Reputation: 887118

We may use Map/mapply for this - loop over the group names, and its corresponding 'n_variants' (assuming they are in order), extract the columns based on the names, apply the condition (!is.na), subset the 'n_variants' and get the sum

mapply(function(x, y) sum(df1[[y]][!is.na(df1[[x]])]), 
     names(df1)[1:2], names(df1)[3:4])
group_a group_b 
      3       4 

Or another option can be done using tidyverse. Loop across the 'n_variants' columns, get the column name (cur_column()) replace the substring with 'group', get the value, create the condition to subset the column and get the sum

library(stringr)
library(dplyr)
df1 %>% 
  summarise(across(contains('variants'),
    ~ sum(.x[!is.na(get(str_replace(cur_column(), 'n_variants', 'group')))])))

-output

  n_variants_a n_variants_b
1            3            4

data

df1 <- structure(list(group_a = c(1L, NA, 1L), group_b = c(NA, 2L, 2L
), n_variants_a = c(1L, 5L, 2L), n_variants_b = c(2L, 4L, 0L)), 
class = "data.frame", row.names = c(NA, 
-3L))

Upvotes: 1

Related Questions