Anne Boysen
Anne Boysen

Reputation: 105

Scrape multiple URLs with rvest

How can I scrape multiple urls when using the read_html in rvest? The goal is to obtain a single document consisting of the text bodies from the respective urls on which to run various analyses.

I tried to concatenate the urls:

 url <- c("https://www.vox.com/","https://www.cnn.com/")
   page <-read_html(url)
   page
   story <- page %>%
        html_nodes("p") %>%  
        html_text

After read_html get an error:

 Error in doc_parse_file(con, encoding = encoding, as_html = as_html, options = options) : 
 Expecting a single string value: [type=character; extent=3].

Not surprised since the read_html probably only handles one path at a time. However, can I use a different function or transformation so several pages can be scraped simultaneously?

Upvotes: 1

Views: 624

Answers (1)

Maurits Evers
Maurits Evers

Reputation: 50668

You can use map (or in base R: lapply) to loop through every url element; here is an example

url <- c("https://www.vox.com/", "https://www.bbc.com/")
page <-map(url, ~read_html(.x) %>% html_nodes("p") %>% html_text())
str(page)
#List of 2
# $ : chr [1:22] "But he was acquitted on the two most serious charges he faced." "Health experts say it’s time to prepare for worldwide spread on all continents." "Wall Street is waking up to the threat of coronavirus as fears about the disease and its potential global econo"| __truncated__ "Johnson, who died Monday at age 101, did groundbreaking work in helping return astronauts safely to Earth." ...
# $ : chr [1:19] "" "\n                                                            The ex-movie mogul is handcuffed and led from cou"| __truncated__ "" "27°C" ...

The return object is a list.

PS. I've changed the second url element because "https://www.cnn.com/" returned NULL for html_nodes("p") %>% html_text().

Upvotes: 1

Related Questions