Reputation: 7316
I'm parsing HTML and trying to get full / not parsed value out of one particular node.
HTML example:
<html>
<body>
<div>Hello <br> World <br> !</div>
<div><object width="420" height="315"></object></div>
</body>
</html>
Code:
def tagsoupParser = new org.ccil.cowan.tagsoup.Parser()
def slurper = new XmlSlurper(tagsoupParser)
def htmlParsed = slurper.parseText(stringToParse)
println htmlParsed.body.div[0]
However it returns only text in case of first node and I get empty string for the second node. Question: how can I retrieve value of the first node such that I get:
Hello <br> World <br> !
Upvotes: 2
Views: 2501
Reputation: 25864
This is what I used to get the content from the first div
tag (omitting xml declaration and namespaces).
@Grab('org.ccil.cowan.tagsoup:tagsoup:1.2.1')
import org.ccil.cowan.tagsoup.Parser
import groovy.xml.*
def html = """<html>
<body>
<div>Hello <br> World <br> !</div>
<div><object width="420" height="315"></object></div>
</body>
</html>"""
def parser = new Parser()
parser.setFeature('http://xml.org/sax/features/namespaces',false)
def root = new XmlSlurper(parser).parseText(html)
println new StreamingMarkupBuilder().bindNode(root.body.div[0]).toString()
<div>Hello <br clear='none'></br> World <br clear='none'></br> !</div>
N.B. Unless I'm mistaken, Tagsoup is adding the closing tags. If you literally want Hello <br> World <br> !
, you might have to use a different library (maybe regex?).
I know it's including the div
element in the output... is this a problem?
Upvotes: 5