Ruby: How to incorporate multithreading into this web-scraping scenario?

Question

I have a list of folders which contain lots of text files. Inside those files are links.

Using each one of those links, I will need to fetch a webpage, parse it, and depending on what's there - save a JPG file into a folder corresponding to the folder name that contains the text file that provided the link.

Now the catch is that there's a LOT of text files and even more links inside of them. I was thinking that it may not be such a bad idea to multithread the process of connecting to and parsing webpages.

So I'll have something like this:

directories.each do |directory|

 ... 

 all_files_in_directory.each do |file|

  ...

  all_urls_in_file do |url|

   # check if there's any threads that aren't busy
   # make a thread go out to the url and parse it

  end

 end


end

I'm a bit unsure how to do that if it's even possible - I can't seem to find a way to have threads just sort of hang out until I tell them some_method() to execute. It's as if what a thread does is assigned to it upon creation and cannot be changed.

So basically I want the script to be able to connect and parse, say, in batches of 5 instead of just 1.

Is this doable, and if so, how would you solve this problem?

pguardiario · Accepted Answer

You should consider eventmachine and em-http-request for concurrent http requests.

Ruby: How to incorporate multithreading into this web-scraping scenario?

Answers (2)

Related Questions