How to extract text only from parent HTML node (excluding child node)?

Question

I have a code:


    
        (22)
        where?

I am using this code to extract text:

html_nodes(messageNode, xpath=".//p") %>% html_text() %>% paste0(collapse="
")

And getting the result:

"(22) where?"

But I need only "p" text, excluding text that could be inside "p" in child nodes. I have to get this text:

"where"

Is there any way to exclude child nodes while I getting text?

Mac OS 10.11.6 (15G31), RSrudio Version 0.99.903, R version 3.3.1 (2016-06-21)

hrbrmstr · Accepted Answer

This will grab all the text from

children (which means it won't include text from sub-nodes that aren't "text emitters":

library(xml2)
library(rvest)
library(purrr)

txt <- '
    
        (22)
        where?
    
  
    stays 
    disappears
    disappears
    disappears
    stays
  
'

doc <- read_xml(txt)

html_nodes(doc, xpath="//p") %>% 
  map_chr(~paste0(html_text(html_nodes(., xpath="./text()"), trim=TRUE), collapse=" "))
## [1] "where?"     "stays stays"

Unfortunately, that's pretty "lossy" (you lose , , etc) but this or @Floo0's (also potentially lossy) solution may work sufficiently for you.

If you use the XML package you can actually edit nodes (i.e. delete node elements).

How to extract text only from parent HTML node (excluding child node)?

Answers (2)

Related Questions