Corey Pembleton
Corey Pembleton

Reputation: 757

Scrape first class node but not child using rvest

many questions on this but couldn't see the answer I'm looking for.

Looking to extract a specific text, with a class .quoteText which with my code works, but also extracts all of the child nodes within .quoteText:

url <- "https://www.goodreads.com/quotes/search?page=1&q=simone+de+beauvoir&utf8=%E2%9C%93"

quote_text <- function(html){

  path <- read_html(html)

  path %>% 
    html_nodes(".quoteText") %>%
    html_text(trim = TRUE) %>% 
    str_trim(side = "both") %>% 
    unlist()
}

quote_text(url)

with the result containing the text, but also every child node!

enter image description here

This is what the inspector tool brings up. What I'm looking for is the highlighted line, but not the sub-lines under the same code.

There must be a way to scrape only that line, no? Or will I need to collect that line, and remove the rest with a str_extract / regex?

enter image description here

Upvotes: 1

Views: 1307

Answers (1)

MrFlick
MrFlick

Reputation: 206401

It doesn't look like the CSS selectors support just getting the immediate text of the selected node, but xpath does. We can adjust your function to just extract the text with

quote_text <- function(html){

  path <- read_html(html)

  path %>% 
    html_nodes(xpath=paste(selectr::css_to_xpath(".quoteText"), "/text()") %>%
    html_text(trim = TRUE) %>% 
    str_trim(side = "both") %>% 
    unlist()
}

I convert the CSS selector to an xpath one and then append "/text()" to just get the text nodes of the elements.

Upvotes: 2

Related Questions