Message Queue: Which one is the best scenario?

Question

I write a web crawler.

The crawler has 2 steps:

get a html page
then parse the page

I want to use message queue to improve performance and throughput.

I think 2 scenarios:

scenario 1:

    structure: 
    urlProducer -> queue1 -> urlConsumer -> queue2 -> parserConsumer

urlProducer: get a target url and add it to queue1

urlConsumer: according to the job info, get the html page and add it to queue2

parserConsumer: according to the job info, parse the page

scenario 2:

    structure:
    urlProducer -> queue1 -> urlConsumer
    parserProducer-> queue2 -> parserConsumer

urlProducer : get a target url and add it to queue1

urlConsumer: according to the job info, get the html page and write it to db

parserProducer: get the html page from db and add it to queue2

parserConsumer: according to the job info, parse the page

There are multiple producers or consumers in each structure.

scenario1 likes a chaining call. It's difficult to find the point of problem, when occurring errors.

scenario2 decouples queue1 and queue2. It's easy to find the point of problem, when occurring errors.

I'm not sure the notion is correct.

Which one is the best scenario? Or other scenarios?

Thanks~

blockcipher · Accepted Answer

I think Scenario 1 is your best option as you don't have to monitor a database, which could slow things down. Not sure what you plan on using to implement this, but I could see it done couple different ways:

Using Kafka and laying out the consumers as you specified.
Using storm and not relying on a message queue. Essentially, you are doing stream processing.

There are other ways you can do this (web services, embedded queue's like ZeroMQ, other brokers, etc.), but since you mentioned throughput, those are two of the scenarios that would give you good throughput.

Message Queue: Which one is the best scenario?

scenario 1:

scenario 2:

Answers (2)

Related Questions