allprog
allprog

Reputation: 16780

How to parse non-well formatted HTML with XmlSlurper

I'm trying to parse a non-well-formatted HTML page with XmlSlurper, the Eclipse download site The W3C validator shows several errors in the page.

I tried the fault-tolerant parser from this post

@Grab(group='net.sourceforge.nekohtml', module='nekohtml', version='1.9.14')
import org.cyberneko.html.parsers.SAXParser 
import groovy.util.XmlSlurper

// Getting the xhtml page thanks to Neko SAX parser 
def mirrors = new XmlSlurper(new SAXParser()).parse("http://www.eclipse.org/downloads/download.php?file=/technology/epp/downloads/release/luna/SR1a/eclipse-jee-luna-SR1a-linux-gtk-x86_64.tar.gz")    

mirrors.'**'

Unfortunately, it looks like not all content is parsed into the XML object. The faulty subtrees are simply ignored.

E.g. page.depthFirst().find { it.text() == 'North America'} returns null instead of the H4 element in the page.

Is there some robust way to parse any HTML content in groovy?

Upvotes: 3

Views: 12989

Answers (2)

bdkosher
bdkosher

Reputation: 5883

I am fond of the tagsoup SAX parser, which says it's designed to parse "poor, nasty and brutish" HTML.

It can be used in conjunction with XmlSlurperquite easily:

@Grab(group='org.ccil.cowan.tagsoup', module='tagsoup', version='1.2')
def parser = new XmlSlurper(new org.ccil.cowan.tagsoup.Parser())

def page = parser.parse('http://www.eclipse.org/downloads/download.php?file=/technology/epp/downloads/release/luna/SR1a/eclipse-jee-luna-SR1a-linux-gtk-x86_64.tar.gz')

println page.depthFirst().find { it.text() == 'North America'}
println page.depthFirst().find { it.text().contains('North America')}    

This results in non-null output.

Upvotes: 4

Opal
Opal

Reputation: 84766

With the following piece of code it's getting parsed well (without errors):

@Grab(group='net.sourceforge.nekohtml', module='nekohtml', version='1.9.14') 
import org.cyberneko.html.parsers.SAXParser 
import groovy.util.XmlSlurper

def parser = new SAXParser()
def page = new XmlSlurper(parser).parse('http://www.eclipse.org/downloads/download.php?file=/technology/epp/downloads/release/luna/SR1a/eclipse-jee-luna-SR1a-linux-gtk-x86_64.tar.gz')

However I don't know which elements exactly You'd like to find.

Here All mirrors are found:

page.depthFirst().find { 
    it.text() == 'All mirrors'
}.@href

EDIT

Both outputs are null.

println page.depthFirst().find { it.text() == 'North America'}

println page.depthFirst().find { it.text().contains('North America')}

EDIT 2

Below You can find a working example that downloads the file and parses it correctly. I used wget to download the file (there's something wrong with downloading it with groovy - don't know what)

@Grab(group='net.sourceforge.nekohtml', module='nekohtml', version='1.9.14') 
import org.cyberneko.html.parsers.SAXParser 
import groovy.util.XmlSlurper

def host = 'http://www.eclipse.org/downloads/download.php?file=/technology/epp/downloads/release/luna/SR1a/eclipse-jee-luna-SR1a-linux-gtk-x86_64.tar.gz'
def temp = File.createTempFile('eclipse', 'tmp')
temp.deleteOnExit()

def cmd = ['wget', host, '-O', temp.absolutePath].execute()
cmd.waitFor()
cmd.exitValue()

def parser = new SAXParser()
def page = new XmlSlurper(parser).parseText(temp.text)

println page.depthFirst().find { it.text() == 'North America'}
println page.depthFirst().find { it.text().contains('North America')}

EDIT 3

And finally problem solved. Using groovy's url.toURL().text causes problems when no User-Agent header is specified. Now it works correctly and elements are found - no external tools used.

@Grab(group='net.sourceforge.nekohtml', module='nekohtml', version='1.9.14') 
import org.cyberneko.html.parsers.SAXParser 
import groovy.util.XmlSlurper

def host = 'http://www.eclipse.org/downloads/download.php?file=/technology/epp/downloads/release/luna/SR1a/eclipse-jee-luna-SR1a-linux-gtk-x86_64.tar.gz'

def parser = new SAXParser()
def page = new XmlSlurper(parser).parseText(host.toURL().getText(requestProperties: ['User-Agent': 'Non empty']))

assert page.depthFirst().find { it.text() == 'North America'}
assert page.depthFirst().find { it.text().contains('North America')}

Upvotes: 8

Related Questions