Ale A
Ale A

Reputation: 359

Python3 parallelize jobs with multiprocessing

I have a script that parses a file containing directories to other file, that have to be opened and read searching for a keyword. Since the number of file is growing I'd like to enable multiprocessing to reduce the amount of time requested to complete the job.

I was thinking to leave the parent process parsing the file containing directories and use child processes to fetch the other files. Since the parent would need to obtain the data before to create childs it would be a blocking architecture (the parent has to read all the file before to call childs), while I'd like to send to one of the childs the list containing directories each 100 results.

So, the parent continues parsing the file while childs work at the same time to find the keyword.

How could I do something like that? If you need more explanations, please, ask me and I'll tell you more.

Thanks.

Upvotes: 1

Views: 517

Answers (1)

S.Lott
S.Lott

Reputation: 391992

I was thinking to leave the parent process parsing the file containing directories and use child processes to fetch the other files.

A directory is a name. The parent parses a list and provides the directory name to each child. Right? The child then reads the files inside the directory.

Since the parent would need to obtain the data before to create childs it would be a blocking architecture (the parent has to read all the file before to call childs),

Um. The child doesn't read the files inside the directory? Up above, it says the child does read the files. It's silly for the parent to read a lot of data and push that to the children.

while I'd like to send to one of the childs the list containing directories each 100 results.

Well. This is different. Now you want to have the parent read a directory name, read a batch of 100 file names and send the file names to the child. Okay. That's less silly than reading all the data. Now it's just 100 names.

So, the parent continues parsing the file while childs work at the same time to find the keyword.

Okay. But you're totally missing the opportunity for parallel processing.

Read the multprocessing module carefully.

What you want are two queues and two kinds of workers.

Your application will build the two queues. It will build a source Process, a pool of "get batch" worker processes, and a pool of "get files" worker processes.

  • Source. This process is (basically) a function that reads the original "file containing directories". And puts each directory name into the "get batch" queue.

  • Get Batch. This is a pool of processes. Each process is a function that gets an entry from the "get batch" queue. This is a directory name. It then reads the directory and enqueues a tuple of 100 file names into the "get files" queue.

  • Get Files. This is a pool of processes. Each process is a function that gets an entry from the "get files" queue. This is a tuple of 100 files. It then opens and reads these 100 files doing god-knows-what with them.

The idea of the multiprocessing module is to use pools of workers that all get their tasks from a queue and put their results into another queue. These workers all run at the same time.

Upvotes: 3

Related Questions