Dominic Comtois
Dominic Comtois

Reputation: 10411

Extract text from dynamic Web page using R

I am working on a data prep tutorial, using data from this article: https://www.nytimes.com/interactive/2021/01/19/upshot/trump-complete-insult-list.html#

None of the text is hard-coded, everything is dynamic and I don't know where to start. I've tried a few things with packages rvest and xml2 but I can't even tell if I'm making progress or not.

I've used copy/paste ang regexes in notepad++ to get a tabular structure like this:

Target Attack
AAA News Fake News
AAA News Fake News
AAA News A total disgrace
... ...
Mr. ZZZ A real nut job

but I'd like to show how to do everything programmatically (no copy/paste).

My main question is as follows: is that even possible with reasonable effort? And if so, any clues on how to get started?

PS: I know that this could be a duplicate, I just can't tell of which question since there are totally different approaches out there :\

Upvotes: 1

Views: 419

Answers (2)

Ian Campbell
Ian Campbell

Reputation: 24838

Here's a programatic approach with RSelenium and rvest:

library(RSelenium)
library(rvest)
library(tidyverse)
driver <- rsDriver(browser="chrome", port=4234L, chromever ="87.0.4280.87")
client <- driver[["client"]]
client$navigate("https://www.nytimes.com/interactive/2021/01/19/upshot/trump-complete-insult-list.html#")
page.source <- client$getPageSource()[[1]]

#Extract nodes for each letter using XPath
Letters <- read_html(page.source) %>%
  html_nodes(xpath = '//*[@id="mem-wall"]/div[2]/div') 

#Extract Entities using CSS
Entities <- map(Letters, ~ html_nodes(.x, css = 'div.g-entity-name') %>%
                  html_text)

#Extract quotes using CSS
Quotes <- map(Letters, ~ html_nodes(.x, css = 'div.g-twitter-quote-container') %>%
                            map(html_nodes, css = 'div.g-twitter-quote-c') %>%
                            map(html_text))

#Bind the entites and quotes together. There are two letters that are blank, so fall back to NA
map2_dfr(Entities, Quotes,
         ~ map2_dfr(.x, .y,~ {if(length(.x) > 0 & length(.y)){data.frame(Entity = .x, Insult = .y)}else{
                                                        data.frame(Entity = NA, Insult = NA)}})) -> Result

#Strip out the quotes
Result %>%
  mutate(Insult = str_replace_all(Insult,"(^“)|([ .,!?]?”)","") %>% str_trim) -> Result

#Take a look at the result
Result %>%
  slice_sample(n=10)
                   Entity                                                              Insult
1             Mitt Romney                                       failed presidential candidate
2         Hillary Clinton                                                             Crooked
3  The “mainstream” media                                                           Fake News
4               Democrats                                             on a fishing expedition
5           Pete Ricketts                                             illegal late night coup
6  The “mainstream” media                                                   anti-Trump haters
7     The Washington Post do nothing but write bad stories even on very positive achievements
8               Democrats                                                                weak
9             Marco Rubio                                                         Lightweight
10     The Steele Dossier                                                      a Fake Dossier

The xpath was obtained by inspecting the webpage source (F9 in Chrome), hovering over elements until the correct one was highlighted, right clicking, and choosing copy XPath as shown:

enter image description here

Upvotes: 1

Dave2e
Dave2e

Reputation: 24089

I used my free articles allocation at The NY Times for the month, but here is some guidance. It looks like the web page uses several scripts to create and display the page.

If you uses your browser's developer tools and look at the network tab, you will find 2 CSV files:

It looks like the reduced file creates the table quoted above and the tweets-full is the full tweet. You can download these files directly with read.csv() and the process this information as needed.

Be sure to read the term of service before scraping any webpage.

Upvotes: 2

Related Questions