SOConnell
SOConnell

Reputation: 793

R and xpathApply -- removing duplicates from nested html tags

I have edited the question for brevity and clarity

My goal is to find and XPath expression that will result in "test1"..."test8" listed separately.

I am working with xpathApply to extract text from web pages. Due to the layout of various different pages that information will be pulled from, I need to extract the XML values from all <font> and <p> html tags. The problem I run into is when one type is nested within the other, resulting in partial duplicates when I use the following xpathApply expression with an or condition.

require(XML)    
html <- 
  '<!DOCTYPE html>
  <html lang="en">
    <body>
      <p>test1</p>
      <font>test2</font>
      <p><font>test3</font></p>
      <font><p>test4</p></font>
      <p>test5<font>test6</font></p>    
      <font>test7<p>test8</p></font>
    </body>
  </html>'
work <- htmlTreeParse(html, useInternal = TRUE, encoding='UTF-8')
table <- xpathApply(work, "//p|//font", xmlValue) 
table

It should be easy to see the type of issue that comes with the nesting--because sometimes <font> and <p> tags are nested, and sometimes they aren't, I can't ignore them but searching for both gives me partial dupes. For other reasons, I prefer the text pieces to be broken up rather than aggregated (that is, taken from the lowest level/furthest nested tag).

The reason I am not just doing two separate searches and then appending them after removing duplicate strings is that I need to preserve the ordering of text as it appears in the html.

Thanks for reading!

Upvotes: 0

Views: 368

Answers (2)

Rich Scriven
Rich Scriven

Reputation: 99331

Looks like this might work

xpathSApply(work, "//body//node()[//p|//font]//text()", xmlValue)
# [1] "test1" "test2" "test3" "test4" "test5" "test6" "test7" "test8"

Just switch to xpathApply for the list result. We could also use getNodeSet

getNodeSet(work, "//body//node()[//p|//font]//text()", fun = xmlValue)
# [[1]]
# [1] "test1"
# 
# [[2]]
# [1] "test2"
# 
# [[3]]
# [1] "test3"
# 
# [[4]]
# [1] "test4"
# 
# [[5]]
# [1] "test5"
# 
# [[6]]
# [1] "test6"
# 
# [[7]]
# [1] "test7"
# 
# [[8]]
# [1] "test8"

Upvotes: 1

SOConnell
SOConnell

Reputation: 793

Okay, I figured it out (entirely due to this post here:http://www.r-bloggers.com/htmltotext-extracting-text-from-html-via-xpath/)

The answer for me was to just take any text within the html and clean out some stuff not needed, like this:

table <- xpathApply(work, "//text()[not(ancestor::script)][not(ancestor::style)][not(ancestor::noscript)]", xmlValue)

Upvotes: 1

Related Questions