Reputation: 163
I write a web crawler.
The crawler has 2 steps:
I want to use message queue to improve performance and throughput.
I think 2 scenarios:
structure:
urlProducer -> queue1 -> urlConsumer -> queue2 -> parserConsumer
urlProducer: get a target url and add it to queue1
urlConsumer: according to the job info, get the html page and add it to queue2
parserConsumer: according to the job info, parse the page
structure:
urlProducer -> queue1 -> urlConsumer
parserProducer-> queue2 -> parserConsumer
urlProducer : get a target url and add it to queue1
urlConsumer: according to the job info, get the html page and write it to db
parserProducer: get the html page from db and add it to queue2
parserConsumer: according to the job info, parse the page
There are multiple producers or consumers in each structure.
scenario1 likes a chaining call. It's difficult to find the point of problem, when occurring errors.
scenario2 decouples queue1 and queue2. It's easy to find the point of problem, when occurring errors.
I'm not sure the notion is correct.
Which one is the best scenario? Or other scenarios?
Thanks~
Upvotes: 0
Views: 241
Reputation: 119
The second scenario would be a better way to handle this problem if you want to use a simple messaging system, in my opinion. The three key tasks that you have implemented are fetching links, fetching pages from the links and parsing them to get required information. We need to keep in mind that the rates at which these operations are performed are different depending upon the size of the page being fetched. You would be better off having an intermediate storage to avoid clogging of the queueing systems.
That being said, I agree with @blockcipher's answer on this thread to use Storm Clusters instead of the simple queueing mechanisms. Storm worker nodes can also define flow control methods for you to vary the rates of flow of the different streams on an average rate. In such a case, the first scenario would be more beneficial. So, your choice would depend upon what you are planning to use for implementing your solution.
Upvotes: 1
Reputation: 2184
I think Scenario 1 is your best option as you don't have to monitor a database, which could slow things down. Not sure what you plan on using to implement this, but I could see it done couple different ways:
There are other ways you can do this (web services, embedded queue's like ZeroMQ, other brokers, etc.), but since you mentioned throughput, those are two of the scenarios that would give you good throughput.
Upvotes: 1