user2967098
user2967098

Reputation: 21

quit and restart a clean R session from within R (Windows 7, RGui 64-bit)

I am trying to quit and restart R from within R. The reason for this is that my job takes up a lot of memory, and none of the common options for cleaning R's workspace reclaim RAM taken up by R. gc(), closeAllConnections(), rm(list = ls(all = TRUE)) clear the workspace, but when I examine the processes in the Windows Task Manager, R's usage of RAM remains the same. The memory is reclaimed when R session is restarted.

I have tried the suggestion from this post:

Quit and restart a clean R session from within R?

but it doesn't work on my machine. It closes R, but doesn't open it again. I am running R x64 3.0.2 through RGui (64-bit) on Windows 7. Perhaps it is just a simple tweak of the first line in the above post:

makeActiveBinding("refresh", function() { shell("Rgui"); q("no") }, .GlobalEnv)

but I am unsure how it needs to be changed.

Here is the code. It is not fully reproducible, because a large list of files is needed that are read in and scraped. What eats memory is the scrape.func(); everything else is pretty small. In the code, I apply the scrape function to all files in one folder. Eventually, I would like to apply to a set of folders, each with a large number of files (~ 12,000 per folder; 50+ folders). Doing so at present is impossible, since R runs out of memory pretty quickly.

library(XML)
library(R.utils)

## define scraper function
scrape.func <- function(file.name){
  require(XML)

  ## read in (zipped) html file
  txt <- readLines(gunzip(file.name))

  ## parse html
  doc <- htmlTreeParse(txt,  useInternalNodes = TRUE)

  ## extract information
  top.data <- xpathSApply(doc, "//td[@valign='top']", xmlValue)
  id <- top.data[which(top.data=="I.D.:") + 1]
  pub.date <- top.data[which(top.data=="Data publicarii:") + 1]
  doc.type <- top.data[which(top.data=="Tipul documentului:") + 1]

  ## tie into dataframe
  df <- data.frame(
    id, pub.date, doc.type, stringsAsFactors=F)
  return(df)
  # clean up
  closeAllConnections()
  rm(txt)
  rm(top.data)
  rm(doc)
  gc()
}

## where to store the scraped data
file.create("/extract.top.data.2008.1.csv")

## extract the list of files from the target folder
write(list.files(path = "/2008/01"), 
      file = "/list.files.2008.1.txt")

## count the number of files
length.list <- length(readLines("/list.files.2008.1.txt"))
length.list <- length.list - 1

## read in filename by filename and scrape
for (i in 0:length.list){
  ## read in line by line
  line <- scan("/list.files.2008.1.txt", '', 
               skip = i, nlines = 1, sep = '\n', quiet = TRUE)
  ## catch the full path 
  filename <- paste0("/2008/01/", as.character(line))
  ## scrape
  data <- scrape.func(filename)
  ## append output to results file
  write.table(data,file = /extract.top.data.2008.1.csv", 
              append = TRUE, sep = ",", col.names = FALSE)
  ## rezip the html
  filename2 <- sub(".gz","",filename)
  gzip(filename2)
}

Many thanks in advance, Marko

Upvotes: 2

Views: 1641

Answers (1)

NMM
NMM

Reputation: 91

I also did some webscraping and ran directily into the same problem like u and it turned me crazy. Although im running a mordern OS (windows 10), the memory is still not released from time to time. after having a look at R FAQ I went for CleanMem, here u can set an automated memory cleaner at every 5 minutes or so. be sure to use

rm(list = ls())
gc()
closeAllConnections()

before so that R releases the memory. Then use CleanMem so that the OS will notice there's free memory.

Upvotes: 1

Related Questions