Reputation: 153
Given a page like "What popular startup advice is plain wrong?", I'd like to be able to extract the first topic under the topic heading on the upper right hand side, in this case, "Common Misconceptions".
What's the best way for me to do this in Ruby? Is it with Nokogiri or a regex? Presumably I need to do some HTML parsing?
Upvotes: 2
Views: 395
Reputation: 160551
First, you almost never, ever, want to use regular expressions to parse/extract/fold/spindle/mutilate XML or HTML. There are too many ways it can go wrong. Regular expressions are great for some jobs, but XML/HTML extractions are not a good fit.
That said, here's what I'd do using Nokogiri:
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open('http://www.quora.com/What-popular-startup-advice-is-plain-wrong'))
topic = doc.at('span a.topic_name span').content
puts topic
Running that outputs:
Common Misconceptions
The code is taking a couple shortcuts, that should work consistently:
OpenURI
allows easy accessing of Internet resources. It's my go-to for most simple to average apps. There are more powerful tools but none as convenient.doc.at
tells Nokogiri to traverse the document, and find the first occurrence of the CSS accessor 'span a.topic_name span
', which should be consistent in that page as the first entry.Note that Nokogiri supports some variants of searching for a node: at
vs. search
. at
and %
and things like css_at
find the first occurrence and return a Node
, which is an individual tag or text or comment. search
, /
, and those variants return a NodeSet
which is like an array of Nodes. You'll have to walk that list or extract the individual nodes you want using some sort of Array accessor. In the above code I could have said doc.search(...).first
to get the node I wanted.
Nokogiri also supports using XPath accessors, but for most things I'll usually go with CSS. It's simpler, and easier to read, but your mileage might vary.
Upvotes: 1