petermeissner
petermeissner

Reputation: 12860

R does not stop grabbing memory / RAM due to XML

I have an double loop like the one shown below the problem is that R (2.15.2) is using more and more memory and I do not understand why.

While I understand that this has to happen within the inner cycle because of the rbind() I am doing there, I do not understand why R keeps grabbing memory when a new cycle of the outer loop starts and actually the objects ( 'xmlCatcher' ) are reused:

# !!!BEWARE this example creates a lot of files (n=1000)!!!!

require(XML)

chunk <- function(x, chunksize){
        # source: http://stackoverflow.com/a/3321659/1144966
        x2 <- seq_along(x)
        split(x, ceiling(x2/chunksize))
    }

chunky <- chunk(paste("test",1:1000,".xml",sep=""),100)

for(i in 1:1000){
writeLines(c(paste('<?xml version="1.0"?>\n <note>\n    <to>Tove</to>\n    <nr>',i,'</nr>\n    <from>Jani</from>\n    <heading>Reminder</heading>\n    ',sep=""), paste(rep('<body>Do not forget me this weekend!</body>\n',sample(1:10, 1)),sep="" ) , ' </note>')
,paste("test",i,".xml",sep=""))
}

for(k in 1:length(chunky)){
gc()
print(chunky[[k]])
xmlCatcher <- NULL

for(i in 1:length(chunky[[k]])){
    filename    <- chunky[[k]][i]
    xml         <- xmlTreeParse(filename)
    xml         <- xmlRoot(xml)
    result      <- sapply(getNodeSet(xml,"//body"), xmlValue)
    id          <- sapply(getNodeSet(xml,"//nr"), xmlValue)
    dummy       <- cbind(id,result)
    xmlCatcher  <- rbind(xmlCatcher,dummy)
    }
save(xmlCatcher,file=paste("xmlCatcher",k,".RData"))
}

Does somebody have an idea why this behaviour might occur? Note that all the objects (like 'xmlCatcher') are reused every cycle so that I would assume that the RAM used should stay about the same after the first 'chunk' cycle.

Is this a bug or do I miss something?

Upvotes: 4

Views: 290

Answers (3)

petermeissner
petermeissner

Reputation: 12860

Its the XML-package stupid!

The answer to this question came by Milan Bouchet-Valat here who proposed I should try to use the useInternalNodes=TRUE-option for xmlTreeParse. That stopped the RAM grabbing although there is also the possibility to manually handle memory-freeing. For further reading see: here.

Upvotes: 2

Romain Francois
Romain Francois

Reputation: 17642

Your understanding of reusing memory is wong.

When you create the new DummyCatcher, the old one is no longer referenced and then becomes candidate for garbage collection, which will happen at some point.

You are not reusing memory, you are creating a new object and abandon the old one.

Garbage collection will free the memory.

Also, i suggest you look at Rprofmem to profile your memory use.

Upvotes: 7

agstudy
agstudy

Reputation: 121568

The chpater 2 of this talk about the rbind as a common|means of being a glutton.

You can avoid the use of rbind inside the loop,

my.list <- vector('list', chunk[k])
for(i in 1:chunk[k]) {
   dummy <- dummy + 1
   my.list[[i]] <- data.frame(dummy)
}
DummyCatcher  <- do.call('rbind', my.list)

Upvotes: 4

Related Questions