Reputation: 251
I would like to use RCurl as a polite webcrawler to download data from a website. Obviously I need the data for scientific research. Although I have the rights to access the content of the website via my university, the terms of use of the website forbid the use of webcrawlers.
I tried to ask the administrator of the site directly for the data but they only replied in a very vague fashion. Well anyway it seems like they won’t simply send the underlying databases to me.
What I want to do now is ask them officially to get the one-time permission to download specific text-only content from their site using an R code based on RCurl that includes a delay of three seconds after each request has been executed.
The address of the sites that I want to download data from work like this: http://plants.jstor.org/specimen/ID of the site
I tried to program it with RCurl but I cannot get it done. A few things complicate things:
One can only access the website if cookies are allowed (I got that working in RCurl with the cookiefile-argument).
The Next-button only appears in the source code when one actually accesses the site by clicking through the different links in a normal browser. In the source code the Next-button is encoded with an expression including
<a href="/.../***ID of next site***">Next > > </a>
When one tries to access the site directly (without having clicked through to it in the same browser before), it won't work, the line with the link is simply not in the source code.
The IDs of the sites are combinations of letters and digits (like “goe0003746” or “cord00002203”), so I can't simply write a for-loop in R that tries every number from 1 to 1,000,000.
So my program is supposed to mimic a person that clicks through all the sites via the Next-button, each time saving the textual content.
Each time after saving the content of a site, it should wait three seconds before clicking on the Next-button (it must be a polite crawler). I got that working in R as well using the Sys.sleep function.
I also thought of using an automated program, but there seem to be a lot of such programs and I don’t know which one to use.
I’m also not exactly the program-writing person (apart from a little bit of R), so I would really appreciate a solution that doesn’t include programming in Python, C++, PHP or the like.
Any thoughts would be much appreciated! Thank you very much in advance for comments and proposals !!
Upvotes: 3
Views: 1198
Reputation: 1586
Try a different strategy.
##########################
####
#### Scrape http://plants.jstor.org/specimen/
#### Idea:: Gather links from http://plants.jstor.org/search?t=2076
#### Then follow links:
####
#########################
library(RCurl)
library(XML)
### get search page::
cookie = 'cookiefile.txt'
curl = getCurlHandle ( cookiefile = cookie ,
useragent = "Mozilla/5.0 (Windows; U; Windows NT 5.1; en - US; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6",
header = F,
verbose = TRUE,
netrc = TRUE,
maxredirs = as.integer(20),
followlocation = TRUE)
querry.jstor <- getURL('http://plants.jstor.org/search?t=2076', curl = curl)
## remove white spaces:
querry.jstor2 <- gsub('\r','', gsub('\t','', gsub('\n','', querry.jstor)))
### get links from search page
getLinks = function() {
links = character()
list(a = function(node, ...) {
links <<- c(links, xmlGetAttr(node, "href"))
node
},
links = function()links)
}
## retrieve links
querry.jstor.xml.parsed <- htmlTreeParse(querry.jstor2, useInt=T, handlers = h1)
## cleanup links to keep only the one we want.
querry.jstor.links = NULL
querry.jstor.links <- c(querry.jstor.links, querry.jstor.xml.parsed$links()[-grep('http', querry.jstor.xml.parsed$links())]) ## remove all links starting with http
querry.jstor.links <- querry.jstor.links[-grep('search', querry.jstor.links)] ## remove all search links
querry.jstor.links <- querry.jstor.links[-grep('#', querry.jstor.links)] ## remove all # links
querry.jstor.links <- querry.jstor.links[-grep('javascript', querry.jstor.links)] ## remove all javascript links
querry.jstor.links <- querry.jstor.links[-grep('action', querry.jstor.links)] ## remove all action links
querry.jstor.links <- querry.jstor.links[-grep('page', querry.jstor.links)] ## remove all page links
## number of results
jstor.article <- getNodeSet(htmlTreeParse(querry.jstor2, useInt=T), "//article")
NumOfRes <- strsplit(gsub(',', '', gsub(' ', '' ,xmlValue(jstor.article[[1]][[1]]))), split='')[[1]]
NumOfRes <- as.numeric(paste(NumOfRes[1:min(grep('R', NumOfRes))-1], collapse = ''))
for(i in 2:ceiling(NumOfRes/20)){
querry.jstor <- getURL('http://plants.jstor.org/search?t=2076&p=',i, curl = curl)
## remove white spaces:
querry.jstor2 <- gsub('\r','', gsub('\t','', gsub('\n','', querry.jstor)))
querry.jstor.xml.parsed <- htmlTreeParse(querry.jstor2, useInt=T, handlers = h1)
querry.jstor.links <- c(querry.jstor.links, querry.jstor.xml.parsed$links()[-grep('http', querry.jstor.xml.parsed$links())]) ## remove all links starting with http
querry.jstor.links <- querry.jstor.links[-grep('search', querry.jstor.links)] ## remove all search links
querry.jstor.links <- querry.jstor.links[-grep('#', querry.jstor.links)] ## remove all # links
querry.jstor.links <- querry.jstor.links[-grep('javascript', querry.jstor.links)] ## remove all javascript links
querry.jstor.links <- querry.jstor.links[-grep('action', querry.jstor.links)] ## remove all action links
querry.jstor.links <- querry.jstor.links[-grep('page', querry.jstor.links)] ## remove all page links
Sys.sleep(abs(rnorm(1, mean=3.0, sd=0.5)))
}
## make directory for saving data:
dir.create('./jstorQuery/')
## Now we have all the links, so we can retrieve all the info
for(j in 1:length(querry.jstor.links)){
if(nchar(querry.jstor.links[j]) != 1){
querry.jstor <- getURL('http://plants.jstor.org',querry.jstor.links[j], curl = curl)
## remove white spaces:
querry.jstor2 <- gsub('\r','', gsub('\t','', gsub('\n','', querry.jstor)))
## contruct name:
filename = querry.jstor.links[j][grep( '/', querry.jstor.links[j])+1 : nchar( querry.jstor.links[j])]
## save in directory:
write(querry.jstor2, file = paste('./jstorQuery/', filename, '.html', sep = '' ))
Sys.sleep(abs(rnorm(1, mean=3.0, sd=0.5)))
}
}
Upvotes: 2
Reputation: 60756
I may be missing exactly the bit you are hung up on, but it sounds like you are almost there.
It seems you can request page 1 with cookies on. Then parse the content searching for the next site ID, then request that page by building the URL with the next site ID. Then scrape whatever data you want.
It sounds like you have code that does almost all of this. Is the problem parsing page 1 to get the ID for the next step? If so, you should formulate a reproducible example and I suspect you'll get a very fast answer to your syntax problems.
If you're having trouble seeing what the site is doing, I recommend using the Tamper Data plug in for Firefox. It will let you see what request is being made at each mouse click. I find it really useful for this type of thing.
Upvotes: 1