Ildar Gabdrakhmanov
Ildar Gabdrakhmanov

Reputation: 185

How to extract text only from parent HTML node (excluding child node)?

I have a code:

<div class="activityBody postBody thing">
    <p>
        <a href="/forum/conversation/post/3904-22" rel="post" data-id="3904-22" class="mqPostRef">(22)</a>
        where?
    </p>
</div>

I am using this code to extract text:

html_nodes(messageNode, xpath=".//p") %>% html_text() %>% paste0(collapse="\n")

And getting the result:

"(22) where?"

But I need only "p" text, excluding text that could be inside "p" in child nodes. I have to get this text:

"where"

Is there any way to exclude child nodes while I getting text?

Mac OS 10.11.6 (15G31), RSrudio Version 0.99.903, R version 3.3.1 (2016-06-21)

Upvotes: 2

Views: 2208

Answers (2)

hrbrmstr
hrbrmstr

Reputation: 78792

This will grab all the text from <p> children (which means it won't include text from sub-nodes that aren't "text emitters":

library(xml2)
library(rvest)
library(purrr)

txt <- '<div class="activityBody postBody thing">
    <p>
        <a href="/forum/conversation/post/3904-22" rel="post" data-id="3904-22" class="mqPostRef">(22)</a>
        where?
    </p>
  <p>
    stays 
    <b>disappears</b>
    <a>disappears</a>
    <span>disappears</span>
    stays
  </p>
</div>'

doc <- read_xml(txt)

html_nodes(doc, xpath="//p") %>% 
  map_chr(~paste0(html_text(html_nodes(., xpath="./text()"), trim=TRUE), collapse=" "))
## [1] "where?"     "stays stays"

Unfortunately, that's pretty "lossy" (you lose <b>, <span>, etc) but this or @Floo0's (also potentially lossy) solution may work sufficiently for you.

If you use the XML package you can actually edit nodes (i.e. delete node elements).

Upvotes: 1

Rentrop
Rentrop

Reputation: 21497

If you are sure the text you want always comes last you can use:

doc %>% html_nodes(xpath=".//p/text()[last()]") %>% xml_text(trim = TRUE)

Alternatively you can use the following to select all "non empty" trings

doc %>% html_nodes(xpath=".//p/text()[normalize-space()]") %>% xml_text(trim = TRUE)

For more details on normalize-space() see https://developer.mozilla.org/en-US/docs/Web/XPath/Functions/normalize-space

3rd option would be to use the xml2 package directly via:

doc %>% xml2::xml_find_chr(xpath="normalize-space(.//p/text())")

Upvotes: 2

Related Questions