Reputation: 11
I'm new to R and need to scrape the titles and the dates on the posts on this website https://www.healthnewsreview.org/news-release-reviews/
Using rvest I was able to write the basic code to get the info:
url <- 'https://www.healthnewsreview.org/?post_type=news-release-review&s='
webpage <- read_html(url)
date_data_html <- html_nodes(webpage,'span.date')
date_data <- html_text(date_data_html)
head(date_data)
webpage <- read_html(url)
title_data_html <- html_nodes(webpage,'h2')
title_data <- html_text(title_data_html)
head(title_data)
But since the website only displays 10 items at first, and then you have to click "view more" I don't know how to scrape the whole site. Thank you!!
Upvotes: 1
Views: 146
Reputation: 78792
Introducing third-party dependencies should be done as a last resort. RSelenium (as r2evans posited as the only solution, originally) is not necessary the vast majority of the time, including now. (It is necessary for gosh-awful sites that use horrible tech like SharePoint since maintaining state without a browser context for that is more pain than it's worth).)
If we start with the main page:
library(rvest)
pg <- read_html("https://www.healthnewsreview.org/news-release-reviews/")
We can get the first set of links (10 of them):
pg %>%
html_nodes("div.item-content") %>%
html_attr("onclick") %>%
gsub("^window.location.href='|'$", "", .)
## [1] "https://www.healthnewsreview.org/news-release-review/more-unwarranted-hype-over-the-unique-benefits-of-proton-therapy-this-time-in-combo-with-thermal-therapy/"
## [2] "https://www.healthnewsreview.org/news-release-review/caveats-and-outside-expert-balance-speculative-claim-that-anti-inflammatory-diet-might-benefit-bipolar-disorder-patients/"
## [3] "https://www.healthnewsreview.org/news-release-review/plug-for-study-of-midwifery-for-low-income-women-is-fuzzy-on-benefits-costs/"
## [4] "https://www.healthnewsreview.org/news-release-review/tiny-safety-trial-prematurely-touts-clinical-benefit-of-cancer-vaccine-for-her2-positive-cancers/"
## [5] "https://www.healthnewsreview.org/news-release-review/claim-that-milk-protein-alleviates-chemotherapy-side-effects-based-on-study-of-just-12-people/"
## [6] "https://www.healthnewsreview.org/news-release-review/observational-study-cant-prove-surgery-better-than-more-conservative-prostate-cancer-treatment/"
## [7] "https://www.healthnewsreview.org/news-release-review/recap-of-mental-imagery-for-weight-loss-study-requires-that-readers-fill-in-the-blanks/"
## [8] "https://www.healthnewsreview.org/news-release-review/bmjs-attempt-to-hook-readers-on-benefits-of-golf-slices-way-out-of-bounds/"
## [9] "https://www.healthnewsreview.org/news-release-review/time-to-test-all-infants-gut-microbiomes-or-is-this-a-product-in-search-of-a-condition/"
## [10] "https://www.healthnewsreview.org/news-release-review/zika-vaccine-for-brain-cancer-pr-release-headline-omits-crucial-words-in-mice/"
I guess you want to scrape the content of those ^^ so have at it.
But, there's that pesky "View more" button.
When you click on it, it issues this POST
request:
With curlconverter
we can convert it into a callable httr
function (which may not exist given the impossibility of this task). We can wrap that function call in in another function with a pagination parameter:
view_more <- function(current_offset=10) {
httr::POST(
url = "https://www.healthnewsreview.org/wp-admin/admin-ajax.php",
httr::add_headers(
`X-Requested-With` = "XMLHttpRequest"
),
body = list(
action = "viewMore",
current_offset = as.character(as.integer(current_offset)),
page_id = "22332",
btn = "btn btn-gray",
active_filter = "latest"
),
encode = "form"
) -> res
list(
links = httr::content(res) %>%
html_nodes("div.item-content") %>%
html_attr("onclick") %>%
gsub("^window.location.href='|'$", "", .),
next_offset = current_offset + 4
)
}
Now, we can run it (since it defaults to the 10
issued in the first View More click):
x <- view_more()
str(x)
## List of 2
## $ links : chr [1:4] "https://www.healthnewsreview.org/news-release-review/university-pr-misleads-with-claim-that-preliminary-blood-t"| __truncated__ "https://www.healthnewsreview.org/news-release-review/observational-study-on-testosterone-replacement-therapy-fo"| __truncated__ "https://www.healthnewsreview.org/news-release-review/recap-of-lung-cancer-screening-test-relies-on-hyperbole-co"| __truncated__ "https://www.healthnewsreview.org/news-release-review/ties-to-drugmaker-left-out-of-postpartum-depression-drug-study-recap/"
## $ next_offset: num 14
We can pass that new offset to another call:
y <- view_more(x$next_offset)
str(y)
## List of 2
## $ links : chr [1:4] "https://www.healthnewsreview.org/news-release-review/sweeping-claims-based-on-a-single-case-study-of-advanced-c"| __truncated__ "https://www.healthnewsreview.org/news-release-review/false-claims-of-benefit-weaken-news-release-on-experimenta"| __truncated__ "https://www.healthnewsreview.org/news-release-review/contrary-to-claims-heart-scans-dont-save-lives-but-subsequ"| __truncated__ "https://www.healthnewsreview.org/news-release-review/breastfeeding-for-stroke-prevention-kudos-to-heart-associa"| __truncated__
## $ next_offset: num 18
You can do the hard part of scraping the initial article count (it's on the main page) and doing the math to put that in a loop and stop efficiently.
NOTE: If you are doing this scraping to archive the complete site (whether for them or independently) since it's dying at the end of the year, you should comment to that effect and I have better suggestions for that use-case than manual coding in any programming language. There are free, industrial "site preservation" frameworks designed to preserve these types of dying resources. If you just need the article content, then an iterator and custom scraper is likely a 👍🏼 (but, apparently impossible) choice.
NOTE also that the pagination increment of 4
is what the site does when you literally press the button, so this just mimics that functionality.
Upvotes: 4