Reputation: 757
many questions on this but couldn't see the answer I'm looking for.
Looking to extract a specific text, with a class .quoteText
which with my code works, but also extracts all of the child nodes within .quoteText
:
url <- "https://www.goodreads.com/quotes/search?page=1&q=simone+de+beauvoir&utf8=%E2%9C%93"
quote_text <- function(html){
path <- read_html(html)
path %>%
html_nodes(".quoteText") %>%
html_text(trim = TRUE) %>%
str_trim(side = "both") %>%
unlist()
}
quote_text(url)
with the result containing the text, but also every child node!
This is what the inspector tool brings up. What I'm looking for is the highlighted line, but not the sub-lines under the same code.
There must be a way to scrape only that line, no? Or will I need to collect that line, and remove the rest with a str_extract
/ regex?
Upvotes: 1
Views: 1307
Reputation: 206401
It doesn't look like the CSS selectors support just getting the immediate text of the selected node, but xpath
does. We can adjust your function to just extract the text with
quote_text <- function(html){
path <- read_html(html)
path %>%
html_nodes(xpath=paste(selectr::css_to_xpath(".quoteText"), "/text()") %>%
html_text(trim = TRUE) %>%
str_trim(side = "both") %>%
unlist()
}
I convert the CSS selector to an xpath one and then append "/text()" to just get the text nodes of the elements.
Upvotes: 2