Reputation: 407
My question has some similarities to this question: Why do we need message brokers like RabbitMQ over a database like PostgreSQL?
In my current (semi-professional) project I'm also at the point to decide whether to go for a database, message broker-based (e.g. with RabbitMQ) or even a totally different solution.
Let's imagine 2 tools, Tool A and Tool B. Whenever Tool A runs and finished, there might be something to do for Tool B. Execution of Tool A takes quiet some time (> 60 sec) and often there will be nothing to do for Tool B. Tool A provides some metadata for Tool B so Tool B knows what to do.
Message-based solution: Establish a message queue which Tool B is consuming. In case Tool A was executed and Tool B should run, Tool A publishes a message (including the metadata) to the queue which Tool B receives so Tool B will run using the metadata from the message.
Database solution: Whenever Tool A is running it adds a database record with e.g. timestamp, the metadata and status "RUNNING". In case Tool A was executed and Tool B should run, it updates the DB record status to "NEXT_TOOL_B". Tool B is constantly querying the DB for records with "NEXT_TOOL_B" status. In case it finds something, Tool B will run using the metadata from the DB records.
While I'm aware of the disadvantages of the database solution e.g. the constant polling from Tool B, I miss one feature of it in the message-based solution:
Whenever a 3rd Tool, say Tool C, e.g. a control panel UI, wants to know the current status it can also query the DB at any time and it will find a "RUNNING" status in case Tool A is still at work. In the message solution, I don't really see a way to "monitor" the status unless the finish message will be on the queue.
So my question is, can you think way to achieve this using messages or any other method that gets along without polling?
Upvotes: 15
Views: 11886
Reputation: 1776
Disclaimer: I'm the author of cluster-tasks-service - CTS
, the proposed solution or otherwise pattern to consider with usage of other relevant tool.
The thing is, that from general architecture perspective your described functionality seemingly needs both types of solutions:
I'd say that DB based queue would be a solution here. DB based queue definitely suffers from the lower throughput than non-DB based approaches. But, it gives you some benefits like:
In terms of CTS
- which is a cluster aware tasks distributions and management system over DB (provided by the consuming application and running embedded), your problem would be solved by running Tool B as a task enqueued by Tool A at the end of its process with all the relevant data.
Meanwhile, Tool C could use the APIs of CTS
to check the status of the task/s and visualize them as needed.
Upvotes: 1
Reputation: 916
You can make Producers and Consumers of the queue update a table in a NoSQL database or an RDBMS. This will allow you to view the status of your requests at any given time. It will also let you take the advantage of pushing the messages without a need for polling.
Upvotes: 0
Reputation: 16157
The scenario described in the question is that of a system, which is composed of multiple different pieces which work together to achieve a function. In this case, you have three different processes {A,B,C}
, together with a database and optional message queue. All systems, as part of their purpose of being, accept one or more inputs, execute some process, and produce one or more outputs. In your case, one of your outputs desired is the state of the system and its processing, which is not an altogether unreasonable thing to want to have.
Queue or Database?
Now, down to your question. Why use a message queue instead of a database? Both are similar components of a system in that they perform some storage capacity. You might well ask the same question in a refrigerator manufacturing plant- when does it make more sense to use a shelf on the assembly line as opposed to a warehouse?
Databases are like warehouses - they are designed to hold a lot of different things and keep them all relatively straight. A good warehouse allows users to find things in the warehouse quickly, and avoids losing parts and materials. If it goes in, it can easily come back out, but not instantly.
Message queues, on the other hand, are like the shelves located near the operator stations in an assembly line. Parts accumulate there from the previous operation waiting to be consumed by the person running the station. The shelves are designed to hold a small number of the same thing - just like a message queue in a software system. They are close to the worker, so when the next part is ready to be worked, it can be retrieved very quickly (as opposed to a trip to the warehouse, which can take several minutes or more). In addition, the worker has immediate visibility to what's on the shelf - if the shelf is empty, the worker might take a break and wait for it to accumulate a part or two again.
Finally, if one part of the factory grossly over-produces (we don't like it when this happens, because it indicates waste), then the shelves are going to be overwhelmed, and the overage is going to need to be put into the warehouse. Believe it or not, this happens all the time in factories - sometimes stations go down for brief periods of time and the warehouse acts as a longer-term buffer.
When to use one or the other?
So - back to the question. You use a message queue when you expect that your production of messages will usually match the consumption of messages, and you need speed in retrieval. You don't expect things to stay around in the queue very long. Software queue systems, such as RabbitMq, also perform some very specific functions - like ensuring that a job only gets handled by one processor, and that it can get picked up by a different processor if the first goes down.
On the other hand, you would use a database for things which require the persistence of state across multiple processing steps. Your job status is a perfect example of something that should be stored in the database. To continue the factory analogy - think of that as a report that gets sent back to the production planner when each step is completed. The production planner is going to keep it in a database.
You would also want to use a database when there is a likelihood that the queue will get full, or when it's critical that data not get lost between one job step and another. For example, a manufacturing plant will often store its finished products in the warehouse pending shipment to the customer. Use a database for all long-term (more than a few seconds) storage needs in your application.
Bottom Line
Most scalable software systems will have a need for both queues and databases, and the key is knowing when to use each.
Hopefully this makes some degree of sense.
Upvotes: 28