Andrew
Andrew

Reputation: 688

Speeding up operation on matrix columns in R

My dataset looks like the following R dataset

dat <- data.frame(z = seq(0.5, 1,0.1), matrix(1:24, nrow = 6) )
colnames(dat) <- c("z", "A", "B", "C", "D")
dat
#   z  A  B  C  D
#  0.5 1  7 13 19
#  0.6 2  8 14 20
#  0.7 3  9 15 21
#  0.8 4 10 16 22
#  0.9 5 11 17 23
#  1.0 6 12 18 24

I would like to perform the same operation for each entry in columns A, B, C and D, such that I need to add another column to dat where for each one of these column I sum the entry in each row of the remaining three columns, divide this sum by the standard deviation of the row entries, and multiply this ratio by the corresponding row value in column z. For example, take the first entry in column A. The operation is 0.5 * (7 + 13 + 19) / sd(c(7, 13, 19)). For the second entry in column B, it would be 0.6 * (2 + 14 + 20) / sd(c(2, 14, 20)). These operations yield a 6 x 4 matrix, which I need to attach to dat.

My dataset is huge (and I would like to have the function in a way that I can bootstrap it quickly), so I am wondering which one is the fastest way to this. The for loop is quite slow (and it would make bootstrapping a nightmare). I was thinking about the dplyr package, but I'm not very familiar. Thank you.

Upvotes: 4

Views: 114

Answers (5)

Onyambu
Onyambu

Reputation: 79208

one for loop is enough for this:

m=function(x,y){
   l=unlist(dat[y,names(dat)!=x])
   unname(l[1]*sum(l[-1])/sd(l[-1]))
 }
 matrix(mapply(m,names(dat)[-1],t(row(dat[-1]))),nrow(dat),byrow = T)
     [,1]     [,2]     [,3] [,4]
[1,] 3.25 1.800298 1.472971 1.75
[2,] 4.20 2.356753 1.963961 2.40
[3,] 5.25 2.978674 2.520417 3.15
[4,] 6.40 3.666061 3.142338 4.00
[5,] 7.65 4.418912 3.829724 4.95
[6,] 9.00 5.237229 4.582576 6.00

Using tidyverse:

dat%>%
   mutate(i=1:nrow(dat))%>%
   group_by(i)%>%
   gather(key,val,-i)%>%
   summarise(s=list(map_dbl(2:ncol(dat),
       ~val[1]*sum(val[-c(1,.x)])/sd(val[-c(1,.x)]))))%>%
   pull(s)%>%invoke(rbind,.)
     [,1]     [,2]     [,3] [,4]
[1,] 3.25 1.800298 1.472971 1.75
[2,] 4.20 2.356753 1.963961 2.40
[3,] 5.25 2.978674 2.520417 3.15
[4,] 6.40 3.666061 3.142338 4.00
[5,] 7.65 4.418912 3.829724 4.95
[6,] 9.00 5.237229 4.582576 6.00

You can also do:

sapply(1:4,function(x)dat[,1]*colSums(s<-t(dat[-c(1,x+1)]))/sqrt(diag(var(s))))
     [,1]     [,2]     [,3] [,4]
[1,] 3.25 1.800298 1.472971 1.75
[2,] 4.20 2.356753 1.963961 2.40
[3,] 5.25 2.978674 2.520417 3.15
[4,] 6.40 3.666061 3.142338 4.00
[5,] 7.65 4.418912 3.829724 4.95
[6,] 9.00 5.237229 4.582576 6.00

Upvotes: 1

MKR
MKR

Reputation: 20085

A solution using mutate_at can be achieved bu accessing current column name in .funs and then excluding it. The basic trick is to group_by on row_number so that calculations i.e. rowSums and sd happen for each row.

library(dplyr)

dat %>% group_by(grp = row_number()) %>%
    mutate_at(vars(A:D), 
        funs(New = z*rowSums(dat[grp,!names(dat) %in% c("z",quo_name(quo(.)))])/
              sd(dat[grp,!names(dat) %in% c("z",quo_name(quo(.)))]))) %>%
  ungroup() %>%
  select(-grp) %>% as.data.frame()

#     z A  B  C  D A_New    B_New    C_New D_New
# 1 0.5 1  7 13 19  3.25 1.800298 1.472971  1.75
# 2 0.6 2  8 14 20  4.20 2.356753 1.963961  2.40
# 3 0.7 3  9 15 21  5.25 2.978674 2.520417  3.15
# 4 0.8 4 10 16 22  6.40 3.666061 3.142338  4.00
# 5 0.9 5 11 17 23  7.65 4.418912 3.829724  4.95
# 6 1.0 6 12 18 24  9.00 5.237229 4.582576  6.00

Note: A little optimization is possible in above approach by .funs argument having a custom function with search for column names are done only once.

Upvotes: 1

thelatemail
thelatemail

Reputation: 93813

Behold! Some convoluted data.table code:

library(data.table)
setDT(dat)
dat[, row := .I]
mdat <- melt(dat, id.vars=c("row","z"))
dcast(mdat[,
     mdat[.BY[1], on="row"][!.BY[2], on="variable", sum(value)/sd(value)*z[1], by=row],
     by=.(row,variable)
     ][,-1], row ~ variable, value.var="V1")

#   row    A        B        C    D
#1:   1 3.25 1.800298 1.472971 1.75
#2:   2 4.20 2.356753 1.963961 2.40
#3:   3 5.25 2.978674 2.520417 3.15
#4:   4 6.40 3.666061 3.142338 4.00
#5:   5 7.65 4.418912 3.829724 4.95
#6:   6 9.00 5.237229 4.582576 6.00

Upvotes: 4

Ronak Shah
Ronak Shah

Reputation: 388862

I am not sure if you can avoid double loop structure especially when you have to do this operation for each element, however one way we could do this.

dat[paste0("operation", letters[1:4])] <-  t(apply(dat, 1, function(x) 
 sapply(x[-1], function(y) x[1] * sum(setdiff(x[-1], y))/sd(setdiff(x[-1], y)))))


dat
#    z A  B  C  D operationa operationb operationc operationd
#1 0.5 1  7 13 19       3.25   1.800298   1.472971       1.75
#2 0.6 2  8 14 20       4.20   2.356753   1.963961       2.40
#3 0.7 3  9 15 21       5.25   2.978674   2.520417       3.15
#4 0.8 4 10 16 22       6.40   3.666061   3.142338       4.00
#5 0.9 5 11 17 23       7.65   4.418912   3.829724       4.95
#6 1.0 6 12 18 24       9.00   5.237229   4.582576       6.00

Here, we first loop through each row and then for each element in that row, we exclude one element at a time and calculate sum and sd of remaining elements and then multiply it with the first element in that row. We attach this new matrix as new columns in the original data frame.

Upvotes: 4

Lennyy
Lennyy

Reputation: 6132

dat2 <- cbind(dat, matrix(c(
    dat$z * rowSums(dat[,c("B", "C", "D")]) / apply(dat[,c("B", "C", "D")], 1, function(x) {sd(x)}),
    dat$z * rowSums(dat[,c("A", "C", "D")]) / apply(dat[,c("A", "C", "D")], 1, function(x) {sd(x)}),
    dat$z * rowSums(dat[,c("A", "B", "D")]) / apply(dat[,c("A", "B", "D")], 1, function(x) {sd(x)}),
    dat$z * rowSums(dat[,c("A", "B", "C")]) / apply(dat[,c("A", "B", "C")], 1, function(x) {sd(x)})
    ), ncol = 4, dimnames = list(c(1:6), paste0(LETTERS[1:4], "_operation"))))

    z A  B  C  D A_operation B_operation C_operation D_operation
1 0.5 1  7 13 19        3.25    1.800298    1.472971        1.75
2 0.6 2  8 14 20        4.20    2.356753    1.963961        2.40
3 0.7 3  9 15 21        5.25    2.978674    2.520417        3.15
4 0.8 4 10 16 22        6.40    3.666061    3.142338        4.00
5 0.9 5 11 17 23        7.65    4.418912    3.829724        4.95
6 1.0 6 12 18 24        9.00    5.237229    4.582576        6.00

0.5 * (7 + 13 + 19) / sd(c(7, 13, 19)) == dat2[1,"A_operation"]
[1] TRUE
0.6 * (2 + 14 + 20) / sd(c(2, 14, 20)) == dat2[2,"B_operation"]
[1] TRUE

Upvotes: 1

Related Questions