Reputation: 439
I'm trying to develop a code that could solve the problem by combining RSelenium and rvest in which rvest alone always timed out when scraping a list of many websites.
Since using rvest alone doesn't work, RSelenium could solve the problem by opening and closing each website on the list through looping, but I'm afraid this approach might take a long time if the list of websites is very long.
I tried combining my previous codes and adding in new looping through multiple websites using RSelenium, but it doesn't look like it's working.
library(xml2)
library(dplyr)
library(readr)
library(RSelenium)
webpages <- data.frame(name = c("amazon", "apple", "usps", "yahoo", "bbc", "ted", "surveymonkey", "forbes", "imdb", "hp"),
url = c("http://www.amazon.com",
"http://www.apple.com",
"http://www.usps.com",
"http://www.yahoo.com",
"http://www.bbc.com",
"http://www.ted.com",
"http://www.surveymonkey.com",
"http://www.forbes.com",
"http://www.imdb.com",
"http://www.hp.com"))
driver <- rsDriver(browser = c("chrome"))
remDr <- driver[["client"]]
i <- 1
while(i <= 4){
url <- webpages$url[i]
remDr$navigate(url)
page_source <- remDr$getPageSource()
URL <- read_html(page_source)
results <- URL %>% html_nodes("head")
records <- vector("list", length = length(results))
for (i in seq_along(records)) {
title <- xml_contents(results[i] %>%
html_nodes("title"))[1] %>% html_text(trim = TRUE)
description <- results[i] %>%
html_nodes("meta[name=description]") %>% html_attr("content")
keywords <- results[i] %>%
html_nodes("meta[name=keywords]") %>% html_attr("content")
}
i <- i + 1
remDr$close()
return(data.frame(name = x['name'],
url = x['url'],
title = ifelse(length(title) > 0, title, NA),
description = ifelse(length(description) > 0, desc, NA),
keywords = ifelse(length(keywords) > 0, kw, NA)))
}
The error I'm getting right now is:
Error in UseMethod("read_xml") :
no applicable method for 'read_xml' applied to an object of class "list"
My desired results is something like:
url title description keywords
http://www.apple.com Apple website description keywords
http://www.amazon.com Amazon website description keywords
http://www.usps.com Usps website description keywords
http://www.yahoo.com Yahoo website description keywords
http://www.bbc.com Bbc website description keywords
http://www.ted.com Ted website description keywords
http://www.surveymonkey.com Survey Monkey website description keywords
http://www.forbes.com Forbes website description keywords
http://www.imdb.com Imdb website description keywords
http://www.hp.com Hp website description keywords
Upvotes: 3
Views: 851
Reputation: 1970
You just needed to change page_source
by page_source[[1]]
and be a little bit more careful about variable naming (e.g. indexers, vectors) and calling. I would also recommend you to print out some message while using loops like these. Furthermore, if you remove remDr$close()
, you can avoid loose of connection. Finally, you can store results in a variable off the loop:
scrapped = list()
i <- 1
while(i <= 4){
url <- webpages$url[i]
print( paste("Accessing to:", url) )
remDr$navigate(url)
page_source <- remDr$getPageSource()
URL <- read_html(page_source[[1]])
results <- URL %>% html_nodes("head")
records <- vector("list", length = length(results))
for (ii in seq_along(records)) {
title <- xml_contents(results[ii] %>% html_nodes("title"))[1] %>%
html_text(trim = TRUE)
desc <- results[ii] %>%
html_nodes("meta[name=description]") %>%
html_attr("content")
keywords <- results[ii] %>%
html_nodes("meta[name=keywords]") %>%
html_attr("content")
}
#remDr$close()
scrapped[[i]] = data.frame(name = webpages[i,'name'],
url = webpages[i,'url'],
title = ifelse(length(title) > 0, title, NA),
description = ifelse(length(desc) > 0, desc, NA),
keywords = ifelse(length(keywords) > 0, keywords, NA))
i = i + 1
}
Output
do.call('rbind', scrapped)
# name url title
#1 amazon http://www.amazon.com Amazon.com: Online Shopping for Electronics, Apparel, Computers, Books, DVDs & more
#2 apple http://www.apple.com Apple
#3 usps http://www.usps.com Welcome | USPS
#4 yahoo http://www.yahoo.com Yahoo
description
#1 Online shopping from the earth's biggest selection of books, magazines, music, DVDs, videos, electronics, computers, software, apparel & accessories, shoes, jewelry, tools & hardware, housewares, furniture, sporting goods, beauty & personal care, broadband & dsl, gourmet food & just about anything else.
#2 <NA>
#3 Welcome to USPS.com. Find information on our most convenient and affordable shipping and mailing services. Use our quick tools to find locations, calculate prices, look up a ZIP Code, and get Track & Confirm info.
#4 Las noticias, el correo electrónico y las búsquedas son tan solo el comienzo. Descubre algo nuevo todos los días en Yahoo.
#keywords
#1 Amazon, Amazon.com, Books, Online Shopping, Book Store, Magazine, Subscription, Music, CDs, DVDs, Videos, Electronics, Video Games, Computers, Cell Phones, Toys, Games, Apparel, Accessories, Shoes, Jewelry, Watches, Office Products, Sports & Outdoors, Sporting Goods, Baby Products, Health, Personal Care, Beauty, Home, Garden, Bed & Bath, Furniture, Tools, Hardware, Vacuums, Outdoor Living, Automotive Parts, Pet Supplies, Broadband, DSL
#2 <NA>
#3 Quick Tools, Shipping Services, Mailing Services, Village Post Office, Ship Online, Flat Rate, Postal Store, Ship a Package, Send Mail, Manage Your Mail, Business Solutions, Find Locations, Calculate a Price, Look Up a ZIP Code, Track Packages, Print a Label, Stamps
#4 yahoo, yahoo inicio, yahoo página de inicio, yahoo búsqueda, correo yahoo, yahoo messenger, yahoo juegos, noticias, finanzas, deportes, entretenimiento
Upvotes: 1