cheklapkok
cheklapkok

Reputation: 439

How to loop through multiple websites and extract the same information using RSelenium and rvest in R?

I'm trying to develop a code that could solve the problem by combining RSelenium and rvest in which rvest alone always timed out when scraping a list of many websites.

Since using rvest alone doesn't work, RSelenium could solve the problem by opening and closing each website on the list through looping, but I'm afraid this approach might take a long time if the list of websites is very long.

I tried combining my previous codes and adding in new looping through multiple websites using RSelenium, but it doesn't look like it's working.

library(xml2)
library(dplyr)
library(readr)
library(RSelenium)
webpages <- data.frame(name = c("amazon", "apple", "usps", "yahoo", "bbc", "ted", "surveymonkey", "forbes", "imdb", "hp"),
                       url = c("http://www.amazon.com",
                               "http://www.apple.com",
                               "http://www.usps.com",
                               "http://www.yahoo.com",
                               "http://www.bbc.com",
                               "http://www.ted.com",
                               "http://www.surveymonkey.com",
                               "http://www.forbes.com",
                               "http://www.imdb.com",
                               "http://www.hp.com"))

driver <- rsDriver(browser = c("chrome"))
remDr <- driver[["client"]]

i <- 1
while(i <= 4){
  url <- webpages$url[i]
  remDr$navigate(url)

  page_source <- remDr$getPageSource()

  URL <- read_html(page_source)

  results <- URL %>% html_nodes("head")

  records <- vector("list", length = length(results))

  for (i in seq_along(records)) {
  title <- xml_contents(results[i] %>% 
    html_nodes("title"))[1] %>% html_text(trim = TRUE)
  description <- results[i] %>% 
    html_nodes("meta[name=description]") %>% html_attr("content")
  keywords <- results[i] %>%
    html_nodes("meta[name=keywords]") %>% html_attr("content")
}

  i <- i + 1
  remDr$close()

  return(data.frame(name = x['name'],
                    url = x['url'],
                    title = ifelse(length(title) > 0, title, NA),
                    description = ifelse(length(description) > 0, desc, NA),
                    keywords = ifelse(length(keywords) > 0, kw, NA)))

}

The error I'm getting right now is:

Error in UseMethod("read_xml") : 
  no applicable method for 'read_xml' applied to an object of class "list"

My desired results is something like:

url                            title                 description               keywords
http://www.apple.com           Apple             website description        keywords
http://www.amazon.com          Amazon            website description        keywords
http://www.usps.com            Usps              website description        keywords
http://www.yahoo.com           Yahoo             website description        keywords   
http://www.bbc.com             Bbc               website description        keywords
http://www.ted.com             Ted               website description        keywords
http://www.surveymonkey.com    Survey Monkey     website description        keywords
http://www.forbes.com          Forbes            website description        keywords
http://www.imdb.com            Imdb              website description        keywords
http://www.hp.com              Hp                website description        keywords


Upvotes: 3

Views: 851

Answers (1)

Ulises Rosas-Puchuri
Ulises Rosas-Puchuri

Reputation: 1970

You just needed to change page_source by page_source[[1]] and be a little bit more careful about variable naming (e.g. indexers, vectors) and calling. I would also recommend you to print out some message while using loops like these. Furthermore, if you remove remDr$close(), you can avoid loose of connection. Finally, you can store results in a variable off the loop:

scrapped = list()

i <- 1
while(i <= 4){

  url <- webpages$url[i]

  print( paste("Accessing to:", url) )

  remDr$navigate(url)

  page_source <- remDr$getPageSource()

  URL <- read_html(page_source[[1]])

  results <- URL %>% html_nodes("head")

  records <- vector("list", length = length(results))

  for (ii in seq_along(records)) {

     title <- xml_contents(results[ii] %>%  html_nodes("title"))[1] %>%
      html_text(trim = TRUE)

     desc <- results[ii] %>% 
      html_nodes("meta[name=description]") %>% 
      html_attr("content")

    keywords <- results[ii] %>%
      html_nodes("meta[name=keywords]") %>% 
      html_attr("content")
  }

  #remDr$close()

  scrapped[[i]] =  data.frame(name = webpages[i,'name'],
                             url = webpages[i,'url'],
                             title = ifelse(length(title) > 0, title, NA),
                             description = ifelse(length(desc) > 0, desc, NA),
                             keywords = ifelse(length(keywords) > 0, keywords, NA))
  i = i + 1

}

Output

do.call('rbind', scrapped) 

#    name                   url                                                                               title
#1 amazon http://www.amazon.com Amazon.com: Online Shopping for Electronics, Apparel, Computers, Books, DVDs & more
#2  apple  http://www.apple.com                                                                               Apple
#3   usps   http://www.usps.com                                                                      Welcome | USPS
#4  yahoo  http://www.yahoo.com                                                                               Yahoo
                                                                                                                                                                                                                                                                                                   description
#1 Online shopping from the earth's biggest selection of books, magazines, music, DVDs, videos, electronics, computers, software, apparel & accessories, shoes, jewelry, tools & hardware, housewares, furniture, sporting goods, beauty & personal care, broadband & dsl, gourmet food & just about anything else.
#2                                                                                                                                                                                                                                                                                                             <NA>
#3                                                                                            Welcome to USPS.com. Find information on our most convenient and affordable shipping and mailing services. Use our quick tools to find locations, calculate prices, look up a ZIP Code, and get Track & Confirm info.
#4                                                                                                                                                                                       Las noticias, el correo electrónico y las búsquedas son tan solo el comienzo. Descubre algo nuevo todos los días en Yahoo.

#keywords
#1 Amazon, Amazon.com, Books, Online Shopping, Book Store, Magazine, Subscription, Music, CDs, DVDs, Videos, Electronics, Video Games, Computers, Cell Phones, Toys, Games, Apparel, Accessories, Shoes, Jewelry, Watches, Office Products, Sports & Outdoors, Sporting Goods, Baby Products, Health, Personal Care, Beauty, Home, Garden, Bed & Bath, Furniture, Tools, Hardware, Vacuums, Outdoor Living, Automotive Parts, Pet Supplies, Broadband, DSL
#2                                                                                                                                                                                                                                                                                                                                                                                                                                                    <NA>
#3                                                                                                                                                                             Quick Tools, Shipping Services, Mailing Services, Village Post Office, Ship Online, Flat Rate, Postal Store, Ship a Package, Send Mail, Manage Your Mail,  Business Solutions, Find Locations, Calculate a Price, Look Up a ZIP Code, Track Packages, Print a Label, Stamps
#4                                                                                                                                                                                                                                                                                                 yahoo, yahoo inicio, yahoo página de inicio, yahoo búsqueda, correo yahoo, yahoo messenger, yahoo juegos, noticias, finanzas, deportes, entretenimiento

Upvotes: 1

Related Questions