Pubs
Pubs

Reputation: 35

Alternative to explicit for loop for setting matrix entries based on column indexes

How can I achieve the same as below without using a for-loop?

df1 = data.frame( val = c("a", "c", "c", "b", "e") )  

m1 = matrix(0, nrow=nrow(df1), ncol=length( c("a", "b", "c", "d", "e") ) )
colnames(m1) = c("a", "b", "c", "d", "e")

for(i in 1:nrow(df1)){
  m1[i, df1[i, 1] ] = 1  #For each entry in dataframe, mark the respective column as 1
}

Upvotes: 1

Views: 118

Answers (2)

giliev
giliev

Reputation: 3058

You have several strange things in your code. First df1 is not needed at all, because data.frame is not supposed to store one dimensional vector. val = c("a", "c", "c", "b", "e") is enough. Also, as others suggested, there are more compact (and some more efficient) ways to achieve the same thing. However, if in your actual problem you work with much greater amount of data and you find it easier to use for loops, then you should consider using C++ code (and its for which is much faster).

Here is a benchmarking I did to compare the R and C++ fors, by creating a function which will add first n numbers (I did the test for n = 100K).

Here is the code:

library(Rcpp)
library(rbenchmark)

cppFunction(
  'int cppSum(int n) { 
    int s = 0;
    for(int i = 0; i <= n; i++) {
      s += i;
    }
    return s;
  }'
)

rSum <- function(n) {
  s = 0
  for (i in c(1:n)) {
    s = s + i
  }
  return(s)
}

n = 100000
benchmark(rSum(n), cppSum(n))

And here is the result:

       test replications elapsed relative user.self sys.self user.child sys.child
2 cppSum(n)          100   0.008     1.00      0.00        0          0         0
1   rSum(n)          100   2.790   348.75      2.79        0          0         0

You can notice in the relative column that R function is 348.75 times slower than the C++ function. In a computationally intensive processes it is a great optimization to use C++ for looping. Once, I have been running a for inside some other loop. It would take forever to finish. When I changed the R for with C++ for it finished in couple of minutes.

[Edit] This example does not solve your actual problem. The original question looked for alternative to the slow R for loop, so I suggested you alternative faster for loop, that being the C++ for loop. The working example is not using your data, because it is too small for any benchmarking. Instead, I use loop with 100K iterations, so there could be visible the differences between the 2 different loops.

Upvotes: 0

A. Webb
A. Webb

Reputation: 26456

This

f<-function(m1,df) {
  for(i in 1:nrow(df1))
    m1[i, df1[i, 1] ] = 1
  return(m1)
}

is equivalent to

g<-function(m1,df) {
  m1[cbind(seq_len(nrow(df)),df1[,1])]<-1
  return(m1)
}

The latter is faster for this particular example

> microbenchmark(f(m1,df1),g(m1,df1))
Unit: microseconds
       expr     min      lq      mean  median      uq     max neval cld
 f(m1, df1) 167.085 174.885 194.58999 185.969 200.132 342.379   100   b
 g(m1, df1)  20.116  22.990  27.12403  24.222  27.300 158.053   100  a 

Note, however,

  • both are utilizing the factor levels rather than character column names
  • you should code what is clearest rather than what is fastest unless and until you identify a true bottleneck

Upvotes: 4

Related Questions