Reputation: 122
I'm looking for a more elegant way to apply a function (i.e. sum) to each row in groups of columns. I've got it to work by transposing and collapsing those columns instead, but that requires a lot of calculation for large datasets.
Here's my sample data:
data <- data.table(C1=rep(1,5),C2=rep(2,5),C3=rep(1,5),C4=rep(2,5))
data
#> C1 C2 C3 C4
#> 1: 1 2 1 2
#> 2: 1 2 1 2
#> 3: 1 2 1 2
#> 4: 1 2 1 2
#> 5: 1 2 1 2
group <- data.table(Sample=c("C1","C2","C3","C4"),Group = c("X","Y","X","Y"))
group
#> Sample Group
#> 1: C1 X
#> 2: C2 Y
#> 3: C3 X
#> 4: C4 Y
All I want to do is sum "C1"
and "C3"
(Group X
) together, "C2"
and "C4"
(Group Y
) together, and have the group name as the column name. This is what I want to end up with:
X Y
1: 2 4
2: 2 4
3: 2 4
4: 2 4
5: 2 4
Here's my solution:
data <- data.table(C1=rep(1,5),C2=rep(2,5),C3=rep(1,5),C4=rep(2,5))
group <- data.table(Sample=c("C1","C2","C3","C4"),Group = c("X","Y","X","Y"))
data <- transpose(data)
data <- data[,lapply(.SD,sum),by=list(group$Group)]
data <- transpose(data,make.names = "group")
data
#> X Y
#> 1: 2 4
#> 2: 2 4
#> 3: 2 4
#> 4: 2 4
#> 5: 2 4
It works, but I'm sure there's a better way of doing it. Transposing twice is pretty expensive for large matrices.
Upvotes: 1
Views: 78
Reputation: 887291
If they are in the same order, use split.default
setDT(lapply(split.default(data, group$Group), rowSums))[]
-output
X Y
1: 2 4
2: 2 4
3: 2 4
4: 2 4
5: 2 4
If the column names are not in the same order, then use a matching with named vector
nm1 <- setNames(group$Group, group$Sample)[colnames(data)]
setDT(lapply(split.default(data, nm1), rowSums))[]
Or may also do the split
from the 'group' data and loop over the list
, extract the columns, and do rowSums
setDT(lapply(split(group$Sample, group$Group),
function(x) rowSums(data[, ..x])))[]
set.seed(24)
data_test <- as.data.table(matrix(rnorm(5000 * 5000), ncol = 5000, dimnames = list(NULL, paste0("C", 1:5000))))
group_test <- data.table(Sample= paste0("C", 1:5000),Group = rep(LETTERS[1:10], 500) )
system.time({
nm1 <- setNames(group_test$Group, group_test$Sample)[colnames(data_test)]
setDT(lapply(split.default(data_test, nm1), rowSums))[]
})
# user system elapsed
# 0.167 0.048 0.219
system.time({
long <- melt(data_test[, rn := .I], "rn")
dcast(long[group_test, on = "variable==Sample"], rn ~ Group, sum)
})
# user system elapsed
# 2.897 0.305 3.189
Upvotes: 2
Reputation: 21928
Here is another solution you could use:
library(dplyr)
library(tidyr)
library(rlang)
group %>%
group_by(Group) %>%
summarise(Sum = eval_tidy(parse_expr(paste0(Sample, collapse = "+")), data = data)) %>%
mutate(id = row_number()) %>%
pivot_wider(names_from = Group, values_from = Sum)
# A tibble: 5 x 3
id X Y
<int> <dbl> <dbl>
1 1 2 4
2 2 2 4
3 3 2 4
4 4 2 4
5 5 2 4
Upvotes: 0
Reputation: 42564
Perhaps, it might be worthwhile to consider a completely different storage format for data
.
Reshaping data
to long format will allow for treating the column names as data items and to join with group
for instance.
long <- melt(data[, rn := .I], "rn")
dcast(long[group, on = "variable==Sample"], rn ~ Group, sum)
rn X Y 1: 1 2 4 2: 2 2 4 3: 3 2 4 4: 4 2 4 5: 5 2 4
Upvotes: 2