Reputation: 77
I want to build a particular function, which sums rows of different combinations of columns in whole data frame.
Example - I have a data frame with 1st descriptive column and 20 columns (v1, v2, v3, ...) from 1 to 20, where only possible values are 1 or 0. I want a function, where I can sum values from columns based on specific occurrence. If per example function would be set on 3, then I need that first 3 columns (v1, v2 and v3) would be summed together in first newly created variable (v1s), then next 3 columns would be summed together (v2, v3, v4) in next newly created variable (v2s), etc.
Reproducible example:
set.seed(300)
db <- matrix(sample(0:1,5*20, replace=TRUE),5,20)
us <- c("A","B","C","D","E")
db <- as.data.frame(cbind(us,db))
So I want to be in control to defining how many columns do I want summed up (by in control I mean it shouldn't be hard to change this to 5 columns per example). In case of 3 columns, I want to create 18 additional vars (all continuing combos of 3 variables from 1st to last) containing row sums of 3 columns.
I tried few options, however I can't really figure out easy way to do it. I was thinking of a function, which takes as input number of columns and then rowsums up values and iterates over all combos, however I can't manage to crack this one. Please help!
Upvotes: 0
Views: 334
Reputation: 42544
If I understand the question correctly the result can be obtained by reshaping the data from wide to long format, summing over a rolling window for each group id us
and by reshaping back to wide format:
library(data.table)
# define number of columns to sum over
width <- 3L
# reshape from wide to long format
melt(setDT(db), id = "us")[
# make sure that column values can be added
, value := as.integer(value)][
# sum across a rolling window for each group
, Reduce("+", shift(value, n = 0:(width - 1L), type = "lead")), by = us][
# remove rows from incomplete window sizes
!is.na(V1)][
# reshape to wide format again
, dcast(.SD, us ~ sprintf("S%02i", rowid(us)))]
which returns
us S01 S02 S03 S04 S05 S06 S07 S08 S09 S10 S11 S12 S13 S14 S15 S16 S17 S18 1: A 2 2 3 2 2 1 1 1 1 2 1 2 2 2 2 1 2 1 2: B 2 2 1 2 2 2 1 1 2 2 2 1 2 1 1 1 1 2 3: C 2 2 2 2 2 3 3 2 2 2 3 3 2 2 1 2 2 2 4: D 2 2 2 2 1 2 2 2 2 2 2 2 1 1 1 2 3 3 5: E 3 3 2 1 0 1 1 2 2 2 1 1 2 3 2 1 1 1
For n <- 5L
we get
us S01 S02 S03 S04 S05 S06 S07 S08 S09 S10 S11 S12 S13 S14 S15 S16 1: A 4 3 4 3 2 2 2 2 2 3 3 3 3 3 3 2 2: B 3 3 3 3 2 3 3 2 3 3 3 2 2 2 2 2 3: C 3 3 4 4 4 4 4 4 4 4 4 4 3 3 3 3 4: D 3 3 3 3 3 3 3 4 3 3 3 2 2 3 3 4 5: E 4 3 2 2 1 2 3 3 2 3 3 3 3 3 3 2
As prerequisite, db
must not contain any NA
values.
Upvotes: 1
Reputation: 23889
There might be an easier way. Anyhow, here is my approach:
Code:
grp_colsum <- function(df, n) {
# generate the set of indices 123, 234, 345, ...
idx <- sapply(1:n, `+`, 0:(ncol(df)-n))
# for each set of indices
res <- apply(idx, 1, function(x) {
rowSums(df[,x])
})
colnames(res) <- paste0("v", 1:ncol(res), "s")
res
}
grp_colsum(db[,-1], 3)
Output:
v1s v2s v3s v4s v5s v6s v7s v8s v9s v10s v11s v12s v13s v14s v15s v16s v17s v18s
[1,] 2 2 3 2 2 1 1 1 1 2 1 2 2 2 2 1 2 1
[2,] 2 2 1 2 2 2 1 1 2 2 2 1 2 1 1 1 1 2
[3,] 2 2 2 2 2 3 3 2 2 2 3 3 2 2 1 2 2 2
[4,] 2 2 2 2 1 2 2 2 2 2 2 2 1 1 1 2 3 3
[5,] 3 3 2 1 0 1 1 2 2 2 1 1 2 3 2 1 1 1
Upvotes: 2