Extract text from dynamic Web page using R

Question

I am working on a data prep tutorial, using data from this article: https://www.nytimes.com/interactive/2021/01/19/upshot/trump-complete-insult-list.html#

None of the text is hard-coded, everything is dynamic and I don't know where to start. I've tried a few things with packages rvest and xml2 but I can't even tell if I'm making progress or not.

I've used copy/paste ang regexes in notepad++ to get a tabular structure like this:

Target	Attack
AAA News	Fake News
AAA News	Fake News
AAA News	A total disgrace
...	...
Mr. ZZZ	A real nut job

but I'd like to show how to do everything programmatically (no copy/paste).

My main question is as follows: is that even possible with reasonable effort? And if so, any clues on how to get started?

PS: I know that this could be a duplicate, I just can't tell of which question since there are totally different approaches out there :\

Ian Campbell · Accepted Answer

Here's a programatic approach with RSelenium and rvest:

library(RSelenium)
library(rvest)
library(tidyverse)
driver <- rsDriver(browser="chrome", port=4234L, chromever ="87.0.4280.87")
client <- driver[["client"]]
client$navigate("https://www.nytimes.com/interactive/2021/01/19/upshot/trump-complete-insult-list.html#")
page.source <- client$getPageSource()[[1]]

#Extract nodes for each letter using XPath
Letters <- read_html(page.source) %>%
  html_nodes(xpath = '//*[@id="mem-wall"]/div[2]/div') 

#Extract Entities using CSS
Entities <- map(Letters, ~ html_nodes(.x, css = 'div.g-entity-name') %>%
                  html_text)

#Extract quotes using CSS
Quotes <- map(Letters, ~ html_nodes(.x, css = 'div.g-twitter-quote-container') %>%
                            map(html_nodes, css = 'div.g-twitter-quote-c') %>%
                            map(html_text))

#Bind the entites and quotes together. There are two letters that are blank, so fall back to NA
map2_dfr(Entities, Quotes,
         ~ map2_dfr(.x, .y,~ {if(length(.x) > 0 & length(.y)){data.frame(Entity = .x, Insult = .y)}else{
                                                        data.frame(Entity = NA, Insult = NA)}})) -> Result

#Strip out the quotes
Result %>%
  mutate(Insult = str_replace_all(Insult,"(^“)|([ .,!?]?”)","") %>% str_trim) -> Result

#Take a look at the result
Result %>%
  slice_sample(n=10)
                   Entity                                                              Insult
1             Mitt Romney                                       failed presidential candidate
2         Hillary Clinton                                                             Crooked
3  The “mainstream” media                                                           Fake News
4               Democrats                                             on a fishing expedition
5           Pete Ricketts                                             illegal late night coup
6  The “mainstream” media                                                   anti-Trump haters
7     The Washington Post do nothing but write bad stories even on very positive achievements
8               Democrats                                                                weak
9             Marco Rubio                                                         Lightweight
10     The Steele Dossier                                                      a Fake Dossier

The xpath was obtained by inspecting the webpage source (F9 in Chrome), hovering over elements until the correct one was highlighted, right clicking, and choosing copy XPath as shown:

Extract text from dynamic Web page using R

Answers (2)

Related Questions