wwnpo01
wwnpo01

Reputation: 77

Maximizing speed of a loop/apply function

I am quite struggling with a huge data set at the moment. What I would like to do is not very complicated, but the matter is that it is just too slow. In the first step, I need to check whether a website is active or not. For this intention, I used the following code (here with a sample of three API-pathes)

library(httr)

Updated <- function(x){http_error(GET(x))}  
websites <- data.frame(c("https://api.crunchbase.com/v3.1/organizations/designpitara","www.twitter.com","www.sportschau.de"))
abc <- apply(websites,1,Updated)

I already noticed that a for loop is pretty much faster than the apply function. However, the full code (which has around 1MIllion APIs to check) still would take around 55 hours to be executed. Any help is appreciated :)

Upvotes: 0

Views: 246

Answers (2)

Feakster
Feakster

Reputation: 556

Alternatively, something like this would work for passing multiple libraries to the PSOCK cluster:

clusterEvalQ(cl, {
     library(data.table)
     library(survival)
})

Upvotes: 1

Feakster
Feakster

Reputation: 556

The primary limiting factor will probably be the time taken to query the website. Currently, you're waiting for each query to return a result before executing the next one. The best way to speed up the workflow would be to execute batches of queries in parallel.

If you're using a Unix system you could try the following:

### Packages ###
library(parallel)

### On your example ###
abc <- unlist(mclapply(websites[[1]], Updated, mc.cores = 3))

### On a larger number of sites ###
abc <- unlist(mclapply(websites[[1]], Updated, mc.cores = detectCores())

### You can even go beyond your machine's core count ###
abc <- unlist(mclapply(websites[[1]], Updated, mc.cores = 40))

However, the precise number of threads at which you saturate your processor/internet connection is kind of dependent upon your machine and your connection.

Alternatively, if you're stuck on Windows:

### For a larger number of sites ###
cl <- makeCluster(detectCores(), type = "PSOCK")
clusterExport(cl, varlist = "websites")
clusterEvalQ(cl = cl, library(httr))
abc <- parSapply(cl = cl, X = websites[[1]], FUN = Updated, USE.NAMES = FALSE)
stopCluster(cl)

In the case of PSOCK clusters, I'm not sure whether there are any benefits to be had from exceeding your machine's core count, although I'm not a Windows person, and I welcome any correction.

Upvotes: 1

Related Questions