On_an_island
On_an_island

Reputation: 509

R-automating web page text scrape

I'm trying to automate scraping text from a website using rvest but I'm getting the error below when I try a loop that reads web page urls from vector: book.titles.urls. However, when I try to scrape the desired text from a single page (without the loop), it works just fine:

Working Code

library(rvest)
library(tidyverse)

#Paste URL to be read by read_html function
lex.url <- 'https://fab.lexile.com/search/results?keyword=The+True+Story+of+the+Three+Little+Pigs'
lex.webpage <- read_html(lex.url)

#Use CSS selectors to scrape lexile numbers and covert data to text
lex.num <- html_nodes(lex.webpage, '.results-lexile-code')
lex.num.txt <- html_text(lex.num[1])

lex.num.txt
> lex.num.txt
[1] "AD510L"

Reprex

library(rvest)
library(tidyverse)

book.titles <- c("The+True+Story+of+the+Three+Little+Pigs",
             "The+Teacher+from+the+Black+Lagoon",
             "A+Letter+to+Amy",
             "The+Principal+from+the+Black+Lagoon",
             "The+Art+Teacher+from+the+Black+Lagoon")
book.titles.urls <- paste0("https://fab.lexile.com/search/results?keyword=", book.titles)

out <- length(book.titles)
for (i in seq_along(book.titles.urls)) {
  node1 <- html_session(i)
  lex.url <- as.character(book.titles.urls[i])
  lex.webpage <- read_html(lex.url[i])
  lex.num <- html_nodes(node1, lex.webpage[i], '.results-lexile-code')
  lex.num.txt <- html_text(lex.num[i][1])
  out <- lex.num.txt[i]
}

Error code

Error in httr::handle(url) : is.character(url) is not TRUE

Upvotes: 2

Views: 188

Answers (1)

Dave2e
Dave2e

Reputation: 24149

The error is due to you are passing an integer to the html_session function, the function is expecting a character string (i.e. a url). I do not believe it is necessary to create as session, generally this function is used if you need to log into the web site with as user id and password.

You can simplify your loop:

#output list
output<-list()
j<-1   #index
for (i in book.titles.urls) {
  lex.num <- html_nodes(read_html(i), '.results-lexile-code')
  # process the  returned list of nodes, lex.num, here
  output[[j]]<-html_text(lex.num)
  j<-j+1
}

I have not tested this but I will provide this warning: When scraping a web site, please ensure you are agree and abide to terms of service agreement.

Edit: Here is a further simplification using lapply which returns a list of vectors with the result of each call statement

library(dplyr)
listofresults<-lapply(book.titles.urls, function(i) {read_html(i) %>% 
    html_nodes( '.results-lexile-code') %>% 
    html_text()})

Upvotes: 3

Related Questions