Reputation: 311
I am trying to read data files from a directory using foreach. But it is giving me an error. It works on my office machine, but does not work at home. Both machines have 4 cores, check output at the bottom. Here is the code
rm(list=ls())
setwd("D:/Test")
library(foreach)
library(doParallel)
c1<-makeCluster(4, outfile = "debug.txt")
CE<-clusterEvalQ(c1, .libPaths(""))
registerDoParallel(c1)
print(paste0("Cores = ",detectCores()))
file.names <- dir(pattern ="h00|B00")
output<-list()
output<-foreach (i=1:4) %dopar% {
read.table(file=file.names[i])
}
stopCluster(c1)
I am getting error:
Error in { : task 1 failed - "cannot open the connection"
Upvotes: 2
Views: 2760
Reputation: 5590
I'm not an expert in parallel operations in R, and I can't help with why you've got different behaviour on different machines (are they the same OS, with the same versions of R and R packages?). My understanding is that functions like foreach
start up multiple R sessions in the background, each of which acts as a "node" to compute a subset of the operations. In your case, each of these nodes needs to find files to feed read.table
, so I personally think it's good practice to pass full file paths whenever using parallel processes.
Using dir
with default parameters returns relative file paths (ie: relative to your current working directory) meaning that you need to stay in your current working directory to correctly refer to them. I'll explain by first setting my working directory to my Desktop:
desktop.path <- "~/Desktop"
setwd( desktop.path )
getwd()
# [1] "/Users/ross/Desktop"
Now we can get the path to a ".txt" file, which sits on my Desktop, in a few ways. First, with default parameters.
file.default <- dir( pattern = "txt" )
file.default
# [1] "CW_denseCloud_LowestQual_withGCPs.txt"
Notice there's nothing in that link to show where the file is, we're relying on our current working directory to find it, which is fine for now.
file.exists( file.default )
[1] TRUE
But if we end up in another working directory, we'll lose the file:
setwd( "~" )
file.exists( file.default )
# [1] FALSE
If we pass the parameter full.names = TRUE
, we get more than the file name itself, but it's still a relative path, which doesn't help:
setwd( desktop.path )
dir( pattern = "txt", full.names = TRUE )
# [1] "./CW_denseCloud_LowestQual_withGCPs.txt"
What will help is passing a complete path to dir
as well, such that dir
is effectively looking at the file relative to the root directory, rather than from the current working directory:
file.full <- dir( path = desktop.path, pattern = "txt", full.names = TRUE )
file.full
# [1] "/Users/ross/Desktop/CW_denseCloud_LowestQual_withGCPs.txt"
Now we've got a complete path to the file, rather than a relative one, meaning that we'll find this file, regardless of where we're sitting (working directory):
file.exists( file.full )
# [1] TRUE
setwd( "~" )
file.exists( file.full )
# [1] TRUE
Now, even if the working directory isn't passed properly to each processing node, they can still find the files they need.
Upvotes: 1
Reputation: 311
rm(list=ls())
setwd("D:/Test")
library(foreach)
library(doParallel)
c1<-makeCluster(4, outfile = "debug.txt")
CE<-clusterEvalQ(c1, .libPaths(""))
registerDoParallel(c1)
print(paste0("Cores = ",detectCores()))
file.names <- dir("D:/Test",pattern ="h00|b00",full.names=TRUE)
output<-list()
output<-foreach (i=1:4) %dopar% {
read.table(file=file.names[i])
}
stopCluster(c1)
Upvotes: 1