Web crawler in Ruby: How to achieve the best perfomance?

I'm writing a web-crawler that should be able to parse multiple pages at the same time. I use Nokogiri for parsing which is quiet good and solve all my tasks, but I don't know how to achieve better perfomance.

I use threads to make many open-uri requests at the same time and it makes the process quicker, but it seems that it's still far from the potential that I can achieve from a single server. Should I use multiple processes? What are the limits of the threads and processes that can be launched for a single ruby application?

By the other words: how to achieve the best performance in this case.

Upvotes: 2

Answers (5)

HaNdTriX

Reputation: 29746

Hey another way is to use a combination of Nokogiri and IronWorker (IronMQ and IronCache).

See a full blog entry on the Topic here

Upvotes: 3

Marc Seeger

Reputation: 2727

If you want something easy go for http://anemone.rubyforge.org/
If you want something fast, code something with eventmachine/em-http-request

I found redis to be a great multi purpose tool for queue management, caching and so on. You could also use specialized things like beanstalkd/active mq/... but at least in my use case, I didn't really find them to be a big advantage compared to redis. Especially the load on the backend system could be a bottleneck, so chose your database carefully and pay attention to what you save

Upvotes: 1

Clint Miller

Reputation: 15371

We use a combination of ActiveMQ/Active Messaging, Event Machine, and multi-threading for this problem. We start off with a big list of URL's to fetch. We then break them down into batches of 100 URL's per batch. Each batch is then pushed into ActiveMQ. Then, we have an array of poller/consumer processes listening to the queue. These consumers can all be on one computer, or they can be spread across multiple computers. The array of consumers can grow arbitrarily large to support as much parallelism as we want. The consumers use Active Messaging, which is a nice Ruby integration with ActiveMQ.

When a consumer receives a message to process a batch of 100 URL's, it kicks off Event Machine to create a thread pool that can process multiple messages in multiple threads. Like you, we use Nokogiri to process each URL.

So, there are three levels of parallelism:

1) Multiple concurrent requests per consumer process, supported by Event Machine and threads.

2) Multiple consumer processes per computer.

3) Multiple computers.

Upvotes: 1

Nemo157

Reputation: 3589

While it sounds like you're not looking for something quite so complex I found this thesis an interesting read awhile ago: Building blocks of a scalable webcrawler - Marc Seeger.

In terms of threading/process limits Ruby has very low threading potential. Standard Ruby (MRI/YARV) and Rubinius don't support simultaneous thread execution, unless using an extension specifically built to support it. Depending on how much of your performance trouble is in the IO and how much is in the processing I could suggest using EventMachine.

Multi process however Ruby works very well, as long as you've got a good manager/database for all the processes to communicate with then running multiple processes should scale as well as your processing power allows.

Upvotes: 3

the Tin Man

Reputation: 160551

I really like Typhoeus and Hydra for handling multiple requests at once.

Typhoeus is the http client side, and Hydra is the part that handles multiple requests. The examples are good so go through them and see.

Upvotes: 4

Web crawler in Ruby: How to achieve the best perfomance?

Answers (5)

Related Questions