How do I extract a HTML topic heading from a web page?

Question

Given a page like "What popular startup advice is plain wrong?", I'd like to be able to extract the first topic under the topic heading on the upper right hand side, in this case, "Common Misconceptions".

What's the best way for me to do this in Ruby? Is it with Nokogiri or a regex? Presumably I need to do some HTML parsing?

the Tin Man · Accepted Answer

First, you almost never, ever, want to use regular expressions to parse/extract/fold/spindle/mutilate XML or HTML. There are too many ways it can go wrong. Regular expressions are great for some jobs, but XML/HTML extractions are not a good fit.

That said, here's what I'd do using Nokogiri:

require 'nokogiri'
require 'open-uri'

doc = Nokogiri::HTML(open('http://www.quora.com/What-popular-startup-advice-is-plain-wrong'))

topic = doc.at('span a.topic_name span').content
puts topic

Running that outputs:

Common Misconceptions

The code is taking a couple shortcuts, that should work consistently:

Using Ruby's OpenURI allows easy accessing of Internet resources. It's my go-to for most simple to average apps. There are more powerful tools but none as convenient.
doc.at tells Nokogiri to traverse the document, and find the first occurrence of the CSS accessor 'span a.topic_name span', which should be consistent in that page as the first entry.

Note that Nokogiri supports some variants of searching for a node: at vs. search. at and % and things like css_at find the first occurrence and return a Node, which is an individual tag or text or comment. search, /, and those variants return a NodeSet which is like an array of Nodes. You'll have to walk that list or extract the individual nodes you want using some sort of Array accessor. In the above code I could have said doc.search(...).first to get the node I wanted.

Nokogiri also supports using XPath accessors, but for most things I'll usually go with CSS. It's simpler, and easier to read, but your mileage might vary.

How do I extract a HTML topic heading from a web page?

Answers (1)

Related Questions