SJC
SJC

Reputation: 652

Writing to large matrices within a function - fast vs slow

[Question amended following responses]

Thanks for the responses. I was unclear in my question, for which I apologise.

I'll try to give more details of our situation. We have c. 100 matrices that we keep in an environment. Each is very large. If at all possible we want to avoid any copying of these matrices when we perform updates. We're often running up against the 2GB memory limit, so this is very important for us.

So our two requirements are 1) avoiding copies and 2) addressing the matrices indirectly by name. Speed, whilst important, is a side-issue that would be solved by avoiding the copying.

It appears to me that Tommy's solution involved creating a copy (though it did entirely answer my actual original question, so I'm the one at fault).

The code below is what seems most obvious to us, but it clearly creates a copy (as shown by the memory.size increase)

myenv <- new.env()
myenv$testmat1 <- matrix(1.0, nrow=6000, ncol=200)

testfnDirect <- function(paramEnv) {
    print(memory.size())

    for (i in 1:300) {
        temp <- paramEnv$testmat1[10,] 
        paramEnv$testmat1[10,] <- temp * 0
    }   
    print(memory.size())
}
system.time(testfnDirect(myenv))

Using the with keyword seems to avoid this, as shown below:

myenv <- new.env()
myenv$testmat1 <- matrix(1.0, nrow=6000, ncol=200)

testfnDirect <- function(paramEnv) {
    print(gc())
    varname <- "testmat1" # unused, but see text
    with (paramEnv, {
        for (i in 1:300) {
            temp <- testmat1[10,] 
            testmat1[10,] <- temp * 0
        }
    })
    print(gc())
}
system.time(testfnDirect(myenv))

However, that code works by addressing testmat1 directly by name. Our problem is that we need to address it indirectly (we don't know in advance which matrices we'll be updating).

Is there a way of amending testfnDirect such that we use the variable varname rather than hardcoding testmat

Upvotes: 4

Views: 326

Answers (2)

Patrick Burns
Patrick Burns

Reputation: 897

A fairly recent change to the 'data.table' package was specifically to avoid copying when modifying values. So if your application can handle data.tables for the other operations, that could be a solution. (And it would be fast.)

Upvotes: 2

Tommy
Tommy

Reputation: 40851

Well, it would be nice if you could explain why the first solution isn't OK... It looks much neater AND runs faster.

To try to answer the questions:

  1. A "nested replacement" operation like foo[bar][baz] <- 42 is very complex, and is optimized for certain cases to avoid copying. But it is very likely that your particular use case is not optimized. That would lead to lots of copies, and loss of performance.

    A way to test that theory is to call gcinfo(TRUE) before your tests. You'll then see that the first solution triggers 2 garbage collects, and the second one triggers around 160!

  2. Here's a variant of your second solution that converts the environment to a list, does its thing and the converts back to an environment. It is as fast as your first solution.

Code:

testfnList <- function() {
    mylist <- as.list(myenv, all.names=TRUE)

    thisvar <- "testmat2"
    for (i in 1:300) {
        temp <- mylist[[thisvar]][10,]
        mylist[[thisvar]][10,] <- temp * 0
    }
    
    myenv <<- as.environment(mylist)
}
system.time(testfnList()) # 0.02 secs

...it would of course be neater if you passed myenv to the function as an argument. A small improvement (if you loop a lot, not just 300 times) would be to index by number instead of name (doesn't work for environments, but for lists). Just change thisvar:

thisvar <- match("testmat2", names(mylist))

Upvotes: 2

Related Questions