Reputation: 11
I'm designing a multi-threaded data pipeline which rests on Service A, I'm following a producer-consumer design to process a list of objects. where my producer calls the api of Service B to retrieve tasks for each object (could be millions of tasks per object), and consumer processes the tasks and posts it onto Service C.
My question is, how do I design this so that when either Service B goes down (producer can't get tasks) or Service C goes down (consumer can't process tasks), I can gracefully stop and save my progress.
I can't store all of my tasks inside a BlockingQueue first because it would be too many tasks to handle in-memory.
I am thinking of storing my tasks inside a DB but then it would dramatically slow down my pipeline as it now has to perform write/read operations. Furthermore, wouldn't it be a waste of DB space if I'm storing millions of tasks that I only ever need to process once? I would have to delete them all from the DB after the processing finishes, so it seems like a waste of effort almost?
Upvotes: 1
Views: 41
Reputation: 1
Firstly, if you can manage to process data in smaller chunks, then go for it instead of bulk processing. Also, message brokers and message queues like RabbitMQ and Apache Kafka can be useful in this scenario. Secondly, you can use cursors or checkpoints to store the last offset in cache and then continue processing data from the last checkpoint after the system restarts.
Upvotes: 0