Reputation: 33755
I would like to crawl a popular site (say Quora) that doesn't have an API and get some specific information and dump it into a file - say either a csv, .txt, or .html formatted nicely :)
E.g. return only a list of all the 'Bios' of the Users of Quora that have, listed in their publicly available information, the occupation 'UX designer'.
How would I do that in Ruby ?
I have a moderate enough level of understanding of how Ruby & Rails work. I just completed a Rails app - mainly all written by myself. But I am no guru by any stretch of the imagination.
I understand RegExs, etc.
Upvotes: 18
Views: 11959
Reputation: 96817
Your best bet would be to use Mechanize.It can follow links, submit forms, anything you will need, web client-wise. By the way, don't use regexes to parse HTML. Use an HTML parser.
Upvotes: 21
Reputation: 20191
Nokogiri is great, but I find the output messy to work with. I wrote a ruby gem to easily create classes off HTML: https://github.com/jassa/hyper_api
The HyperAPI gem uses Nokogiri to parse HTML with CSS selectors.
E.g.
Post = HyperAPI.new_class do
string title: 'div#title'
string body: 'div#body'
string author: '#details .author'
integer comments_count: '#extra .comment' do
size
end
end
# => Post
post = Post.new(html_string)
# => #<Post title: 'Hi there!', body: 'This blog post will talk about...', author: 'Bob', comments_count: 74>
Upvotes: 1
Reputation: 1163
Mechanize is awesome. If you're looking to learn something new though, you could take a look at Scrubyt: https://github.com/scrubber/scrubyt. It looks like Mechanize + Hpricot. I've never used it, but it seems interesting.
Upvotes: 2
Reputation: 10740
If you want something more high level, try wombat, which is this gem I built on top of Mechanize and Nokogiri. It is able to parse pages and follow links using a really simple and high level DSL.
Upvotes: 7
Reputation: 6436
I know the answer has been accepted, but Hpricot is also very popular for parsing HTML.
All you have to do is take a look at the html source of the pages and try to find a XPath or CSS expression that matches the desired elements, then use something like:
doc.search("//p[@class='posted']")
Upvotes: 6