Kim
Kim

Reputation: 4308

Xpath found with Elements but not readable/scrapeable via rvest

I am trying to scrape off the number amounts listed in a set of donation websites. So in this example, I would like to get

$3, $10, $25, $100, $250, $1500, $2800

The xpath indicates that one of them should be

/html/body/div[1]/div[3]/div[2]/div/div[1]/div/div/    
form/div/div[1]/div/div/ul/li[2]/label

and the css selector

li.btn--wrapper:nth-child(2) > label:nth-child(1)

Up to the following, I see something in the xml_nodeset:

library(rvest)
url <- "https://secure.actblue.com/donate/pete-buttigieg-announcement-day"
read_html(url) %>% html_nodes(
  xpath = '//*[@id="cf-app-target"]/div[3]/div[2]/div/div[1]/div/div'
)

Then I see add the second part of the xpath and it shows up blank. Same with

X %>% html_nodes("li")

which gives a bunch of things, but all the StyledButton__StyledAnchorButton-a7s38j-0 kEcVlT turn blank.

I have worked with rvest for a fair bit now, but this one's baffling. And I am not quite sure how RSelenium will help here, although I have knowledge on how to use it for screenshots and clicks. If it helps, the website also refuses to be captured in the wayback machine---there's only the background and nothing else.

I have even tried just taking a screenshot with RSelenium and attempting ocr with tessaract and magick, but while other pages worked this particular example spectacularly fails, because the text is in white and in a rather nonstandard font. Yes, I've tried image_negate and image_resize to see if it helped, but it only showed that relying on OCR is rather a bad idea, as it depends on screenshot size.

Any advice on how to best extract what I want in this situation? Thanks.

Upvotes: 2

Views: 149

Answers (1)

QHarr
QHarr

Reputation: 84465

You can use regex to extract numbers from script tag. You get a comma separated character vector

library(rvest)
library(stringr)
con <- url('https://secure.actblue.com/donate/pete-buttigieg-announcement-day?refcode=website', "rb")
page = read_html(con)
res <- page %>%
  html_nodes(xpath=".//script[contains(., 'preloadedState')]")%>%
  html_text() %>% as.character %>% 
  str_match_all(.,'(?<="amounts":\\[)(\\d+,?)+')
print(res[[1]][,1])

Try it here

Upvotes: 3

Related Questions