rrodrigorn0
rrodrigorn0

Reputation: 195

R: Introducing time intervals when scraping

I'm trying to scrape some websites using "RSelenium". However, it seems like the websites detect my attempt of scraping. Would it be possible to introduce some time gaps between each scrape. My code is this

Library('XML')
library('RSelenium')
checkForServer() # search for and download Selenium Server java binary.  Only need to run once.
startServer() # run Selenium Server binary
remDr <- remoteDriver(browserName="firefox", port=4444) # instantiate remote driver to connect to Selenium Server
remDr$open(silent=T) # open web browser

page_sub = read.csv("indigogo_edu_us.csv")

url_list = as.vector(page_sub$full_url[1:3])  

scrape = function(url_list){  

  remDr$navigate(url_list) # navigates to webpage

  elem <- remDr$findElement(using="class", value="i-description") 
  elemtxt <- elem$getElementAttribute("outerHTML")[[1]] 
  elemxml <- htmlTreeParse(elemtxt, useInternalNodes=T)  

  fundList <- unlist(xpathApply(elemxml, '//input[@title]', xmlGetAttr, 'title')) # parses out just the fund name and ticker using XPath
  page = as.data.frame(xpathSApply(  elemxml,'//div[@class="i-description"]', xmlValue, encoding="UTF-8"))
  names(page)[1] = "description"
}
cc = lapply(url_list, scrape)

Upvotes: 0

Views: 603

Answers (1)

Roman Luštrik
Roman Luštrik

Reputation: 70653

Of course, Sys.sleep. You can also use a random number generator to make it appear random.

Something along the lines of

Sys.sleep(runif(1, min = 3, max = 11))

Upvotes: 2

Related Questions