Learning C
Learning C

Reputation: 689

Good pipeline accessible by multiple process?

I have a couple scripts scraping data from multiple websites. The next step is processing the data. I want to setup a worker that receives data and process the data. What is a good pipeline/workflow approach to having one worker always running and waiting for scraper to feed the data to process?

I thought something like an API server to process the request, but is there a better solution?

Upvotes: 0

Views: 53

Answers (1)

jbch
jbch

Reputation: 186

Without more details I can only give generic recommendations:

If they are all running on the same machine, and the scrapers and worker are started by the same process you could use multiprocessing.Queue in the standard library. It should work for a very simple workflow.

For greater flexibility, I would use a messaging library to communicate between processes. I like ZeroMQ but there are others.

ZeroMQ supports both local inter-process transport and network transport, you can change between transport types with very little code change. If you start with multiprocessing.Queue but it turns out you want to run the workers on a different machine you'll have to rewrite a lot of code.

The scrapers and worker could talk to each other directly (PUSH/PULL pattern), or you could have a broker/queue between them.

If you always only have one worker PUSH-PULL could be sufficient, if you have more you'll want a queue.

PUSH/PULL: each scraper talks to the worker and send it work. The scraper will have to poll each scraper for work.

Queue: the scrapers send tasks to the queue. The worker(s) query the queue for work.

PUSH/PULL is a bit simpler but it means the worker has to be aware of and connect to each scraper. It can get messy if you workflow is complicated.

With a queue the scrapers and worker only need to know about the queue, it acts as central broker.

http://learning-0mq-with-pyzmq.readthedocs.io/en/latest/pyzmq/patterns/pushpull.html

http://learning-0mq-with-pyzmq.readthedocs.io/en/latest/pyzmq/devices/queue.html

Upvotes: 1

Related Questions