Parse all links on a page, visit them, extract the body copy then continue traversing efficiently

Question

So this is what I have:

require 'rubygems'
require 'nokogiri'
require 'open-uri'

root_url = "http://boxerbiography.blogspot.com/2006/11/table-of-contents.html"
file_path = "boxer-noko.html"

site = Nokogiri::HTML(open(root_url))

titles = []
content = []

site.css(".entry a").each do |link|
    titles.push(link)

    content_url = link[:href]
    content_page = Nokogiri::HTML(open(content_url))

    content_page.css("#top p").each do |copy|
        content.push(copy)
    end

end

But what this does is n^n loops. i.e. if there are 5 links on the main page, it goes to the first one, then in content it assigns it the value of all the 5 links (with the current one at the top), then it goes back out and goes to the next one and keeps doing it.

So each piece of content is actually returning the content for every single link, which looks like this:

Link 1

Copy associated with Link 1.
Copy associated with Link 2.
Copy associated with Link 3.
.
.
.

Link 2

Copy associated with Link 2.
Copy associated with Link 3.
Copy associated with Link 4.
Copy associated with Link 5.
Copy associated with Link 1.
.
.
.

etc.

What I would like it to do is return this:

Link 1

Copy associated with Link 1.

Link 2

Copy associated with Link 2.

In as efficient a way as possible.

How do I do that?

Edit1: I guess an easy way to think about this is that in each array, say titles, I would like to store both the link and the content associated with that link. But not quite sure how to do that, given that I have to open two URI connections to parse both pages and keep going back to the root.

So I imagined it like:

title[0] = :href => "http://somelink.com", :content => "Copy associated with some link".

But can't quite get it there, so I am forced to do it using two arrays which seems suboptimal to me.

Dave Newton · Accepted Answer

The following will create a hash with URL keys, each URL's value is the collection of Nokogiri paragraph elements.

require 'rubygems'
require 'nokogiri'
require 'open-uri'

root_url = "http://boxerbiography.blogspot.com/2006/11/table-of-contents.html"

site = Nokogiri::HTML(open(root_url))

contents = {}
site.css(".entry a").each do |link|
    content_url = link[:href]
    p "Fetching #{content_url}..."
    content_page = Nokogiri::HTML(open(content_url))
    contents[link[:href]] = content_page.css("#top p")
end

As a sanity check, you can check the contents of one of the keys like this:

contents[contents.keys.first]

This may or may not be what you actually want, since it'll keep all the inner tags in place (s, ...s, etc.) but that can be tweaked pretty easily by changing how the contents are gathered. Or it can just be handled through post-processing each URL's contents.

If you want to keep more information about each URL (like the link's text) then you'd probably want to create a tiny wrapper class with url and title attributes.

As it stands, the code doesn't do any checking to make sure each URL is only retrieved once--it might be better to create a Set of URLs to force uniqueness, then create the map by iterating over that set's contents (URLs).

Parse all links on a page, visit them, extract the body copy then continue traversing efficiently

Answers (1)

Related Questions