dsp_099
dsp_099

Reputation: 6121

Ruby: How to incorporate multithreading into this web-scraping scenario?

I have a list of folders which contain lots of text files. Inside those files are links.

Using each one of those links, I will need to fetch a webpage, parse it, and depending on what's there - save a JPG file into a folder corresponding to the folder name that contains the text file that provided the link.

Now the catch is that there's a LOT of text files and even more links inside of them. I was thinking that it may not be such a bad idea to multithread the process of connecting to and parsing webpages.

So I'll have something like this:

directories.each do |directory|

 ... 

 all_files_in_directory.each do |file|

  ...

  all_urls_in_file do |url|

   # check if there's any threads that aren't busy
   # make a thread go out to the url and parse it

  end

 end


end

I'm a bit unsure how to do that if it's even possible - I can't seem to find a way to have threads just sort of hang out until I tell them some_method() to execute. It's as if what a thread does is assigned to it upon creation and cannot be changed.

So basically I want the script to be able to connect and parse, say, in batches of 5 instead of just 1.

Is this doable, and if so, how would you solve this problem?

Upvotes: 1

Views: 689

Answers (2)

pguardiario
pguardiario

Reputation: 54984

You should consider eventmachine and em-http-request for concurrent http requests.

Upvotes: 1

Martin James
Martin James

Reputation: 24847

Typically, such activities are performed by queueing 'task' objects to a pool of threads that are waiting on a producer-consumer 'pool queue'. Each thread loops around forever, pulling tasks off the queue and calling a virtual 'run' method of the task. Usually, if they wish, tasks can create more tasks and submit them to the pool queue.

Different 'task' class descendants can have a run() method that does different things & so, even though the thread is indeed 'doing what was assigned to it upon creation' - that something means hanging about on a queue and then, when tasks are available, caling different overridden methods in different tasks.

Flow control, right. Make a batchURL' task class that can hold 'batch size' urls. At start, create.. say.. 100 of them and push them onto an 'objectQueue', (a producer-consumer queue class like the pool queue). In your readline loop, pop a batchURL, load it up with urls and submit it to the pool queue. When a pool thread has done with a batchURL, push it back onto the objectQueue for re-use. This puts a cap on the outstanding batchURLs - if the readLine tries to queue up too many batchURLs, it will find the objectQueue empty and so wil block until some batchURLs are recycled by the pool.

If you use a reasonable number of batchSIze, batchURLs and threads, the batchURLs should happily circulate around the objectQueue/workThead/poolQueue loop, carrying the data around from your readLine to the work threads in an efficient and effective manner.

Upvotes: 2

Related Questions