Reputation: 284
I have a really simple question (I think) but I can't find an answer anywhere on stackoverflow. I have written a loop that uses repec_id
entries for academic papers from a large dataset (150,000 entries) and then pulls the reference list from a database called RePEc for each paper. It looks like this:
url_base <- "http://citec.repec.org/api/amf/"
##for loop
references_1 <-vector("list", length=length(df$repec_id))
for(i in seq_along(df$repec_id))
try({get_data <- read_html(paste0(url_base, df$repec_id[i], usercode))
get_references <- html_nodes(get_data,'references') %>% html_nodes("text") %>% html_attr("ref")
references_1[[i]] <- paste((get_references), collapse =" ")
print(i)
})
For the sake of speed, I want to run the loop 5 times analysing 30,000 IDs each time (e.g. ID 1-30,000 then ID 30,001 to 60,000 then ID 60,001 to 90,000 and so on) and then combine these into a single list (references_1
). Does anyone know how I can do this?
Unfortunately, the usercode
only works on my IP so this example isn't reproducible but I think (hope) my question doesn't rely on reproducibility... Thank you in advance for your help!
Upvotes: 0
Views: 55
Reputation: 523
To break this up, instead of doing seq_along, one option is to specify a range of i's to loop over for each of the 5 times you want to run this.
start <- 1
for(i in start:min(start + 29999, length(df$repec_id)){
...
That should take whatever you set your starting value as and loop through a total of 30,000 iterations from there--unless 30,000 would put you past the length of df$respec_id, which is why the min is there.
That said, I'm not sure how this would speed things up, unless your concern is that you want to break up the process so you're not just letting this run indefinitely. (If that's the case, I usually just include print(i) as part of my loop to track my progress.)
Upvotes: 1