Parallel for-loop in Windows

Question

I do many SQL-queries to an industry-strength database far away, but it takes long time to receive the results. It was much faster when my computer with R was almost next to the database, which leads me to believe that it is the latency between my computer and the database that is the bottleneck, and that running parallel queries might speed things up. We are on different continents.

Here is a working version that is not in parallel:

doQueries <- function(filenameX, inp1, inp2) {
  print(paste("Starting:", inp1, inp2, ",saving to", filenameX, sep=" "))
  # Here should the query be (using RODBC)
  # save(queryresults, file="filenameX")
}

input.rows <- cbind(c("file1.rda","file2.rda","file3.rda"),c("A","B","C"),c(12,13,14))

for (i in 1:nrow(input.rows)) {
  doQueries(filenameX=input.rows[i,1], inp1=input.rows[i,2], inp2=input.rows[i,3])
}

I have tried with the following code, but the foreach-library do not seem to be avaliable and as I understand from reading on CRAN, parallel is replacing earlier packages for parallelling ("package ‘foreach’ is not available (for R version 2.15.0)").

library(parallel)
library(foreach)
foreach (i=1:nrow(input.rows)) %dopar% {
  doQueries(filenameX=input.rows[i,1], inp1=input.rows[i,2], inp2=input.rows[i,3])
}

How should I do this instead?

Thanks to all contributors on Stackoverflow!

/Chris

Update: Thanks to nograpes, I managed to load the libraries. The following code seems to work:

library(RODBC)
library(doParallel)
library(foreach)

# odbcCloseAll()
# my_conn <- odbcConnect("database", uid="xx", pwd="yy", case="nochange")

doQueries <- function(filenameX, inp1, inp2) {
  print(paste("Starting:", inp1, inp2, ",saving to", filenameX, sep=" "))
  # sql.test <- RODBC::sqlQuery(my_conn, "SELECT * FROM zzz LIMIT 100", rows_at_time=1024)
  # save(sql.test, file="filenameX")
}

input.rows <- cbind(c("file1.rda","file2.rda","file3.rda"),c("A","B","C"),c(12,13,14))

cl <- makeCluster(3)
registerDoParallel(cl)

foreach (i=1:nrow(input.rows)) %dopar% {
  doQueries(filenameX=input.rows[i,1], inp1=input.rows[i,2], inp2=input.rows[i,3])
}

stopCluster(cl)

But when I include the actual SQL-query, this error-message appears: Error in { : task 1 failed - "first argument is not an open RODBC channel"

Could it be so that this conceptually will not work? That RODBC cannot handle more than one query at a time?

I really appreciate all support.

/Chris

Update 2: Thanks a lot nograpes for the very good and impressive answers. It is difficult to judge if the data transfers themselves are faster (I think about 20% faster total throughput), but I found that as the queries (about 100) have different response-times, and need postprocessing (which I include in the function before saving), I get a better utilization of the link and local CPU together. I.e. with just one query at the time, the CPUs will be almost unused during the data transfer, and then the link will be quiet while the CPUs are working. With parallel queries, I see data arriving and the CPUs working at the same time. In total it became much faster. Thanks a lot!

/Chris

nograpes · Accepted Answer

As I mentioned in my comment, this technique will probably not be faster. To answer your question, the foreach package is available for your version of R. Perhaps your selected repository hasn't been updated yet. Try this:

install.packages('foreach', repos='http://cran.us.r-project.org')

which should install the package. If that doesn't work, grab the binary for your OS here, and just install it through the menus.

If the bottleneck is the network connection, then you can only speed up the process by reducing the amount of stuff you put on the network. One idea would be to remotely connect to the database server, have it dump the query to a file (on the server), compress it, and then download it to your computer, then have R uncompress and load it. It sounds like a lot, but you can probably do the entire process within R.

Following up on your update, it appears that you did not include a .packages argument in your foreach statement. This is why you had to prefix the sqlQuery function with RODBC::. It is necessary to specify what packages the loop needs, because I think it essentially starts a new R session for each node, and each session needs to be initialized with the packages. Similarly, you can't access my_conn because it was outside of the loop, you need to create inside the loop so every node has their own copy.

library(RODBC)
library(foreach)
library(doParallel)
setwd('C:/Users/x/Desktop')

doQueries <- function(filenameX) {
  sql.text<-sqlQuery(my_conn, 'SELECT * FROM table;')
  save(sql.text, file=filenameX)
}

cl <- makeCluster(2)
registerDoParallel(cl)

foreach (i=1:2, .packages='RODBC') %dopar% {
  my_conn <- odbcConnect("db", uid="user", pwd="pass")
  doQueries(filenameX=paste('file_',i,sep=''))
}

But, like I mentioned, this probably won't be any faster.

Parallel for-loop in Windows

Answers (1)

Related Questions