Reputation: 185
I have a code:
<div class="activityBody postBody thing">
<p>
<a href="/forum/conversation/post/3904-22" rel="post" data-id="3904-22" class="mqPostRef">(22)</a>
where?
</p>
</div>
I am using this code to extract text:
html_nodes(messageNode, xpath=".//p") %>% html_text() %>% paste0(collapse="\n")
And getting the result:
"(22) where?"
But I need only "p" text, excluding text that could be inside "p" in child nodes. I have to get this text:
"where"
Is there any way to exclude child nodes while I getting text?
Mac OS 10.11.6 (15G31), RSrudio Version 0.99.903, R version 3.3.1 (2016-06-21)
Upvotes: 2
Views: 2208
Reputation: 78792
This will grab all the text from <p>
children (which means it won't include text from sub-nodes that aren't "text emitters":
library(xml2)
library(rvest)
library(purrr)
txt <- '<div class="activityBody postBody thing">
<p>
<a href="/forum/conversation/post/3904-22" rel="post" data-id="3904-22" class="mqPostRef">(22)</a>
where?
</p>
<p>
stays
<b>disappears</b>
<a>disappears</a>
<span>disappears</span>
stays
</p>
</div>'
doc <- read_xml(txt)
html_nodes(doc, xpath="//p") %>%
map_chr(~paste0(html_text(html_nodes(., xpath="./text()"), trim=TRUE), collapse=" "))
## [1] "where?" "stays stays"
Unfortunately, that's pretty "lossy" (you lose <b>
, <span>
, etc) but this or @Floo0's (also potentially lossy) solution may work sufficiently for you.
If you use the XML
package you can actually edit nodes (i.e. delete node elements).
Upvotes: 1
Reputation: 21497
If you are sure the text you want always comes last you can use:
doc %>% html_nodes(xpath=".//p/text()[last()]") %>% xml_text(trim = TRUE)
Alternatively you can use the following to select all "non empty" trings
doc %>% html_nodes(xpath=".//p/text()[normalize-space()]") %>% xml_text(trim = TRUE)
For more details on normalize-space()
see https://developer.mozilla.org/en-US/docs/Web/XPath/Functions/normalize-space
3rd option would be to use the xml2
package directly via:
doc %>% xml2::xml_find_chr(xpath="normalize-space(.//p/text())")
Upvotes: 2