Web-scraping across multiple links using R

Question

I'm attempting to create a tidy dataframe of some news releases for several websites. Most of the websites are structured so that there is a main page of headlines with a short blurb, then a link to the main article. I'd like to scrape all of the main articles from the main page. Here's my approach. Any help would be most appreciated.

library(tidyverse)
library(rvest)
library(xml2)

url_1 <- read_html("http://lifepointhealth.net/news")


## seems to grab the lists
url_1 %>% 
  html_nodes("li") %>% 
  html_text() %>% 
  str_squish() %>% 
  str_trim() %>% 
  enframe()

# A tibble: 80 x 2
    name value                      
                          
 1     1 Who We Are Our Company Mis…
 2     2 Our Company Mission, Visio…
 3     3 Mission, Vision, Values an…
 4     4 Giving Quality a Voice     
 5     5 How We Operate             
 6     6 Leadership                 
 7     7 Awards                     
 8     8 20th Anniversary           
 9     9 Our Communities Explore Ou…
10    10 Explore Our Communities    
# … with 70 more rows


# this grabs the titles but there should be many more
url_1 %>% 
  html_nodes("li .title") %>% 
  html_text() %>% 
  str_squish() %>% 
  str_trim() %>% 
  enframe() 

# A tibble: 20 x 2
    name value                      
                          
 1     1 LifePoint Health Names Elm…
 2     2 David Steitz Named Chief E…
 3     3 LifePoint Health Receives …
 4     4 Thousands of Top U.S. Hosp…
 5     5 Conemaugh Nason Medical Ce…
 6     6 Vicki Parks Named CEO of W…
 7     7 LifePoint Health Honors Ka…
 8     8 Ennis Regional Medical Cen…
 9     9 LifePoint Health Business …
10    10 LifePoint Health and R1 RC…

xwhitelight · Accepted Answer

Follow the Network tab of Dev Tool, you will see the page sends requests to http://lifepointhealth.net/api/posts each time you click "Load more". Imitate the request like below, you will be able to scrape all 332 post details:

items <- httr::POST(
  "http://lifepointhealth.net/api/posts",
  config = httr::add_headers(`Content-Type` = "application/x-www-form-urlencoded"),
  body = "skip=0&take=332&Type=News&tagFilter=",
  encode = "multipart"
) %>% 
  httr::content() %>%
  .$Items

items <- dplyr::bind_rows(lapply(items, function(f) {
  as.data.frame(Filter(Negate(is.null), f))
}))

Web-scraping across multiple links using R

Answers (1)

Related Questions