Scrape first class node but not child using rvest

Question

many questions on this but couldn't see the answer I'm looking for.

Looking to extract a specific text, with a class .quoteText which with my code works, but also extracts all of the child nodes within .quoteText:

url <- "https://www.goodreads.com/quotes/search?page=1&q=simone+de+beauvoir&utf8=%E2%9C%93"

quote_text <- function(html){

  path <- read_html(html)

  path %>% 
    html_nodes(".quoteText") %>%
    html_text(trim = TRUE) %>% 
    str_trim(side = "both") %>% 
    unlist()
}

quote_text(url)

with the result containing the text, but also every child node!

This is what the inspector tool brings up. What I'm looking for is the highlighted line, but not the sub-lines under the same code.

There must be a way to scrape only that line, no? Or will I need to collect that line, and remove the rest with a str_extract / regex?

MrFlick · Accepted Answer

It doesn't look like the CSS selectors support just getting the immediate text of the selected node, but xpath does. We can adjust your function to just extract the text with

quote_text <- function(html){

  path <- read_html(html)

  path %>% 
    html_nodes(xpath=paste(selectr::css_to_xpath(".quoteText"), "/text()") %>%
    html_text(trim = TRUE) %>% 
    str_trim(side = "both") %>% 
    unlist()
}

I convert the CSS selector to an xpath one and then append "/text()" to just get the text nodes of the elements.

Scrape first class node but not child using rvest

Answers (1)

Related Questions