Reputation: 1573

extracting XML from HTML?

The XML is embedded under the <pre> tag of the returned HTML page. I can extract the contents of the <pre> tag, but I am unable to convert this to XML correctly. I tried using the to_xml method of the NodeSet class, but it seems that the line endings (\n) are messing up the parsing.

Here is a snippet of my code:

url = "http://www.ncbi.nlm.nih.gov/pubmed/?term=NS044283[GR]&dispmax=200&report=xml"
doc = Nokogiri::XML(open(url))
pre = doc.xpath('//pre')
xml = pre.to_xml
contents = Nokogiri::XML(xml)
articles = contents.xpath('\\PubmedArticle')
(article = [])

Upvotes: 3

Answers (3)

the Tin Man

Reputation: 160551

The document being retrieved isn't valid XML or HTML. Shame on those who created it.

Here's the first 200 characters, showing some confusion on their part:

require 'open-uri'
url = "http://www.ncbi.nlm.nih.gov/pubmed/?term=NS044283[GR]&dispmax=200&report=xml"
puts open(url).read[0..200]

which returns:

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<pre>
&lt;PubmedArticle&gt;
    &lt;Medl

Luckily, or, maybe determinedly, Nokogiri works around that by being somewhat lenient with malformed HTML.

Upvotes: 1

Jesse Wolgamott

Reputation: 40277

Since you're going to use Nokogiri to parse it anyway, just call content instead of to_xml:

require 'nokogiri'
require 'open-uri'
url = "http://www.ncbi.nlm.nih.gov/pubmed/?term=NS044283[GR]&dispmax=200&report=xml"
doc = Nokogiri::XML(open(url))
pre = doc.xpath('//pre')
xml = "<root>" + pre.text + "</root>"
contents = Nokogiri::XML(xml)
articles = contents.css('PubmedArticle')
puts contents.css('ArticleTitle').map{|x| x.content}.count   
=> 25

Upvotes: 4

dimuch

Reputation: 12818

The embedded XML is not valid (HTML-escaped). Try to unescape it

...
xml = CGI.unescapeHTML(pre.to_xml) # or CGI.unescapeHTML(pre.to_s)
...

Upvotes: -1

extracting XML from HTML?

Answers (3)

Related Questions