Reputation: 73
I am trying to write code that will go to each page and take information from there. Url <-
I have code to output all hrefs. But it doesn't work.
tagrecode <- readHTMLTable (" paintings-by-alphabet")
tabla <-
names (tabla) <- c("name", "desc", "cat", "updated")
res <- htmlParse (" alphabet")
enlaces <- getNodeSet (res, "//p[@class='pb5']/a/@href")
enlaces <- unlist(lapply(enlaces, as.character))
tabla$enlace <- paste(" alphabet")
lisurl <- tabla$enlace
fu1 <- function(url){
pas1 <- htmlParse(url, useInternalNodes=T)
pas2 <- xpathSApply(pas1, "//p[@class='pb5']/a/@href")
urldef <- lapply(lisurl,fu1)
After i have list of the urls of all pictures on this page i want to go to the second-third-...-23 pages to collect urls of all pictures.
Next step- to scrap info about every picture. I have working code for one and i need to build it in one general code.
url = ""
doc = htmlTreeParse(url, useInternalNodes=T)
pictureName <- xpathSApply(doc,"//h1[@itemprop='name']", xmlValue)
date <- xpathSApply(doc, "//span[@itemprop='dateCreated']", xmlValue)
author <- xpathSApply(doc, "//a[@itemprop='author']", xmlValue)
style <- xpathSApply(doc, "//span[@itemprop='style']", xmlValue)
genre <- xpathSApply(doc, "//span[@itemprop='genre']", xmlValue)
Every advise how to do this will be appreciated!
Upvotes: 6
Views: 17923
Reputation: 380
You can try Rcrawler package, it's a parallel web scraper, it can crawl, store web pages and scrape its content using XPath.
If you need to collect all pictures information use
Rcrawler(Website = "", no_cores = 4, no_conn = 4, ExtractPatterns =datapattern )
To filter out only Claud Monet picture
Rcrawler(Website = "", no_cores = 4, no_conn = 4, urlregexfilter ="claude-monet/([^/])*", ExtractPatterns =datapattern )
The crawler will take some times to finish as it will traverse all website links. However, you could stop the execution anytime. By default, scraped are in a global viariable named DATA, another variable called INDEX contain all crawled URLs.
If you need to learn how to build your crawler refer to this paper.R crawler
Upvotes: 0
Reputation: 59355
This seems to work.
url <- ""
hrefs <- list()
for (i in 1:23) {
response <- GET(paste0(url,i))
doc <- content(response,type="text/html")
hrefs <- c(hrefs,doc["//p[@class='pb5']/a/@href"])
url <- ""
xPath <- c(pictureName = "//h1[@itemprop='name']",
date = "//span[@itemprop='dateCreated']",
author = "//a[@itemprop='author']",
style = "//span[@itemprop='style']",
genre = "//span[@itemprop='genre']")
get.picture <- function(href) {
response <- GET(paste0(url,href))
doc <- content(response,type="text/html")
info <- sapply(xPath,function(xp)ifelse(length(doc[xp])==0,NA,xmlValue(doc[xp][[1]])))
pictures <-,lapply(hrefs,get.picture))
# pictureName date author style genre
# [1,] "A Corner of the Garden at Montgeron" "1877" "Claude Monet" "Impressionism" "landscape"
# [2,] "A Corner of the Studio" "1861" "Claude Monet" "Realism" "self-portrait"
# [3,] "A Farmyard in Normandy" "c.1863" "Claude Monet" "Realism" "landscape"
# [4,] "A Windmill near Zaandam" NA "Claude Monet" "Impressionism" "landscape"
# [5,] "A Woman Reading" "1872" "Claude Monet" "Impressionism" "genre painting"
# [6,] "Adolphe Monet Reading in the Garden" "1866" "Claude Monet" "Impressionism" "genre painting"
You were actually pretty close. Your xPath is fine; one problem is that not all of the pictures have all of the information (e.g., for some of the pages the nodeSets your are trying to access are empty) - note the date for "A Windnill nead Zaandam". So the code has to deal with this possibility.
So in this example, the first loop grabs the values of the href attribute of the anchor tags for each page (1:23) and combines these into a vector of length ~1300.
To process each of these 1300 pages, and since we have to deal with missing tags, it's more straightforward to create a vector containing the xPath strings and apply that element-wise to each page. That's what the function get.picture(...)
does. The last statement calls this function with each of the 1300 hrefs, and binds the result together row-wise, using,...)
Note also that this code uses the somewhat more compact indexing feature for objects of class HTMLInternalDocument: doc[xpath]
where xpath
is an xPath string. This avoids the use of xpathSApply(...)
, although the latter would have worked.
Upvotes: 9