nerdyaswild
nerdyaswild

Reputation: 122

Data.table: Applying function to rows of grouped columns (could also call it collapsing columns)

I'm looking for a more elegant way to apply a function (i.e. sum) to each row in groups of columns. I've got it to work by transposing and collapsing those columns instead, but that requires a lot of calculation for large datasets.

Here's my sample data:

data <- data.table(C1=rep(1,5),C2=rep(2,5),C3=rep(1,5),C4=rep(2,5))
data
#>   C1 C2 C3 C4
#> 1:  1  2  1  2
#> 2:  1  2  1  2
#> 3:  1  2  1  2
#> 4:  1  2  1  2
#> 5:  1  2  1  2
group <- data.table(Sample=c("C1","C2","C3","C4"),Group = c("X","Y","X","Y"))
group
#>   Sample Group
#> 1:     C1     X
#> 2:     C2     Y
#> 3:     C3     X
#> 4:     C4     Y

All I want to do is sum "C1" and "C3" (Group X) together, "C2" and "C4" (Group Y) together, and have the group name as the column name. This is what I want to end up with:

   X Y
1: 2 4
2: 2 4
3: 2 4
4: 2 4
5: 2 4

Here's my solution:

data <- data.table(C1=rep(1,5),C2=rep(2,5),C3=rep(1,5),C4=rep(2,5))
group <- data.table(Sample=c("C1","C2","C3","C4"),Group = c("X","Y","X","Y"))

data <- transpose(data)
data <- data[,lapply(.SD,sum),by=list(group$Group)]
data <- transpose(data,make.names = "group")
data
#>   X Y
#> 1: 2 4
#> 2: 2 4
#> 3: 2 4
#> 4: 2 4
#> 5: 2 4

It works, but I'm sure there's a better way of doing it. Transposing twice is pretty expensive for large matrices.

Upvotes: 1

Views: 78

Answers (3)

akrun
akrun

Reputation: 887291

If they are in the same order, use split.default

setDT(lapply(split.default(data, group$Group), rowSums))[]

-output

   X Y
1: 2 4
2: 2 4
3: 2 4
4: 2 4
5: 2 4

If the column names are not in the same order, then use a matching with named vector

nm1 <- setNames(group$Group, group$Sample)[colnames(data)]
setDT(lapply(split.default(data, nm1), rowSums))[]

Or may also do the split from the 'group' data and loop over the list, extract the columns, and do rowSums

setDT(lapply(split(group$Sample, group$Group),
       function(x) rowSums(data[, ..x])))[]

Benchmarks

set.seed(24)
data_test <- as.data.table(matrix(rnorm(5000 * 5000), ncol = 5000, dimnames = list(NULL, paste0("C", 1:5000))))

group_test <- data.table(Sample= paste0("C", 1:5000),Group = rep(LETTERS[1:10], 500) )

system.time({
nm1 <- setNames(group_test$Group, group_test$Sample)[colnames(data_test)]
setDT(lapply(split.default(data_test, nm1), rowSums))[]

})
#   user  system elapsed 
#  0.167   0.048   0.219 


system.time({
long <- melt(data_test[, rn := .I], "rn")
dcast(long[group_test, on = "variable==Sample"], rn ~ Group, sum)


})
#   user  system elapsed 
#  2.897   0.305   3.189 

Upvotes: 2

Anoushiravan R
Anoushiravan R

Reputation: 21928

Here is another solution you could use:

library(dplyr)
library(tidyr)
library(rlang)

group %>%
  group_by(Group) %>%
  summarise(Sum = eval_tidy(parse_expr(paste0(Sample, collapse = "+")), data = data)) %>%
  mutate(id = row_number()) %>%
  pivot_wider(names_from = Group, values_from = Sum)

# A tibble: 5 x 3
     id     X     Y
  <int> <dbl> <dbl>
1     1     2     4
2     2     2     4
3     3     2     4
4     4     2     4
5     5     2     4

Upvotes: 0

Uwe
Uwe

Reputation: 42564

Perhaps, it might be worthwhile to consider a completely different storage format for data.

Reshaping data to long format will allow for treating the column names as data items and to join with group for instance.

long <- melt(data[, rn := .I], "rn")
dcast(long[group, on = "variable==Sample"], rn ~ Group, sum)
   rn X Y
1:  1 2 4
2:  2 2 4
3:  3 2 4
4:  4 2 4
5:  5 2 4

Upvotes: 2

Related Questions