dabadaba
dabadaba

Reputation: 9512

Processing web pages concurrently with Ruby

I am trying to process the content of different pages given an array of URLs, using ruby Thread. However, when trying to open the URL I always get this error: #<SocketError: getaddrinfo: Name or service not known>

This is how I am trying to do it:

sites.each do |site|
    threads << Thread.new(site) do |url|
        puts url
        #web = open(url) { |i| i.read } # same issue opening the web this way
        web = Net::HTTP.new(url, 443).get('/', nil)
        lock.synchronize do
            new_md5[sites_hash[url]] = Digest::MD5.hexdigest(web)
        end
    end
end

sites is the array of URLs.

The same program but sequential works alright:

sites.each { |site|
    web = open(site) { |i| i.read }
    new_md5 << Digest::MD5.hexdigest(web)
}

What's the problem?

Upvotes: 0

Views: 47

Answers (1)

the Tin Man
the Tin Man

Reputation: 160551

Ugh. You're going to open a thread for every site you have to process? What if you have 10,000 sites?

Instead, set a limit on the number of threads, and turn sites into a Queue, and have each thread remove a site, process it and get another site. If there are no more sites in the Queue, then the thread can exit.

The example in the Queue documentation will get you started.

Instead of using get and always retrieve the entire body, use a backing database that keeps track of the last time the page was processed. Use head to check to see if the page has been updated since then. If it has, then do a get. That will reduce your, and their, bandwidth and CPU usage. It's all about being a good network citizen, and playing nice with the other people's toys. If you don't play nice, they might not let you play with them any more.

I've written hundreds of spiders and site analyzers. I'd recommend you should always have a backing database and use that to keep track of the sites you're going to read, when you last read them, if they were up or down the last time you tried to get a page, and how many times you've tried to reach them and they were down. (The last is so you don't bang your code's head on the wall trying to reach dead/down sites.)

I had a 75 thread app that read pages. Each thread wrote their findings to the database, and, if a page needed to be processed, that HTML was written to a record in another table. A single app then read that table and did the processing. It was easy for a single app to stay ahead of 75 threads because they're dealing with the slow internet.

The big advantage to using a backend database, is that your code can be shut down, and it'll pick up at the same spot, the next site to be processed, if you write it correctly. You can scale it up to run on multiple hosts easily too.


Regarding not being able to find the host:

Some things I see in your code:

Either of those could explain why using open works but your code doesn't. (I'm assuming you're using OpenURI in conjunction with your single-threaded code though you don't show it, since open by itself doesn't know what to do with a URL.)


In general, I'd recommend using Typhoeus and Hydra to process large numbers of sites in parallel. Typhoeus will handle redirects for you also, along with allowing you to use head requests. You can also set up how many requests are handled at the same time (concurrency) and automatically handles duplicate requests (memoization) so redundant URLs don't get pounded.

Upvotes: 2

Related Questions