Reputation: 3
I'm writing an application in Ruby that can search and fetch data from a site that has more than 10000 pages. I use OpenURI and Nokogiri to open and parse web pages to get data from it and save them to a local data file::
#An example
page = Nokogiri::HTML(open("http://example.com/books/title001.html"))
#Get title, author, synopsys, etc from that page
For me, who has an ADSL connection, it takes an average of 1 second to open a page. Because that site has about 10000 pages, it will take more than 3 hours to open all pages and fetch data of all of the books, an unacceptable time for this application because my users won't want to wait that much time.
How do I open and parse a large number of web pages fast and effectively with OpenURI and Nokogiri?
If I can't do that with them what should I do? And how can some applications that do the same work (list books, get all data from pages and save to a file) such as some manga downloaders just take 5-10 minutes to do that with large manga sites (about 10000 titles)?
Upvotes: 0
Views: 677
Reputation: 160581
Don't use OpenURI first; There is a much better way if you use Hydra and Typhoeus.
Like a modern code version of the mythical beast with 100 serpent heads, Typhoeus runs HTTP requests in parallel while cleanly encapsulating handling logic.
...
Parallel requests:
hydra = Typhoeus::Hydra.new 10.times.map{ hydra.queue(Typhoeus::Request.new("www.example.com", followlocation: true)) } hydra.run
Farther down in the documentation...
How to get an array of responses back after executing a queue:
hydra = Typhoeus::Hydra.new requests = 10.times.map { request = Typhoeus::Request.new("www.example.com", followlocation: true) hydra.queue(request) request } hydra.run
responses = request.map { |request|
request.response.response_body
}
request.response.response_body
is the line you want to wrap with Nokogiri's parser:
Nokogiri::HTML(request.response.response_body)
At that point you'll have an array of DOMs to walk through and process.
But wait! There's more!
Because you want to shave some processing time, you'll want to set up a Thread and Queue, push the parsed DOMs (or just the unparsed HTML response_body
), then have the thread process and write the files.
It's not hard, but starts to put the question out of scope for Stack Overflow as it becomes a small book. Read the Thread and Queue documentation, especially the section about producers and consumers, and you should be able to piece it together. This is from the ri Queue
docs:
= Queue < Object
(from ruby core)
------------------------------------------------------------------------------
This class provides a way to synchronize communication between threads.
Example:
require 'thread'
queue = Queue.new
producer = Thread.new do
5.times do |i|
sleep rand(i) # simulate expense
queue << i
puts "#{i} produced"
end
end
consumer = Thread.new do
5.times do |i|
value = queue.pop
sleep rand(i/2) # simulate expense
puts "consumed #{value}"
end
end
------------------------------------------------------------------------------
= Class methods:
new
= Instance methods:
<<, clear, deq, empty?, enq, length, num_waiting, pop, push, shift, size
I've used it to process large numbers of URLs in parallel and it was easy to set up and use. It's possible to do this using Threads for everything, and not use Typhoeus, but I think it's wiser to piggyback on the existing, well-written, tool than to try to roll your own.
... how can some applications that do the same work (list books, get all data from pages and save to a file) such as some manga downloaders just take 5-10 minutes to do that with large manga sites (about 10000 titles)?
They have:
It's not hard to process that many pages, you just have to be realistic about your resources and use what's available to use wisely.
What's my advice?
Upvotes: 2
Reputation: 48649
Relatively, there is a lot of waiting when doing http requests, which is a good use case for multiple threads/processes. You can create a pool of worker threads/processes that retrieves request data from one Queue, then shoves the results into another Queue, which your main thread can read from.
See here: https://blog.engineyard.com/2014/ruby-thread-pool
how can some applications that do the same works (list books, get all data from pages and save to a file) such as some manga downloaders just take 5-10 minutes to do that with large manga sites (about 10000 titles)?
Computing power. If you had a 10,000 core computer(or 10,000 computers with one core each), you could start one process for every request, and then all the requests would execute at the same time. The total time to complete all the requests would be just the time it took for the longest request to finish--rather than the sum of all the times of all the requests.
Upvotes: 0