Khaled
Khaled

Reputation: 158

Multiprocessing handling of files in python

I am referring to this answer in order to handle multiple files at once using multiprocessing but it stalls and doesn't work

That is my try:

import multiprocessing
import glob
import json

def handle_json(file):
    with open(file, 'r', encoding = 'utf-8') as inp, open(file.replace('.json','.txt'), 'a', encoding = 'utf-8', newline = '') as out:
        length = json.load(inp).get('len','') #Note: each json file is not large and well formed
        out.write(f'{file}\t{length}\n')

p = multiprocessing.Pool(4)
for f, file in enumerate(glob.glob("Folder\\*.json")):
    p.apply_async(handle_json, file)
    print(f)

p.close()
p.join() # Wait for all child processes to close.

Where is the problem exactly, I thought it may be because I have 3000 json files so I copied just 50 into another folder and tried with them but also the same problem

ADDED: Debug with VS Code

Exception has occurred: RuntimeError       (note: full exception trace is shown but execution is paused at: <module>)

        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.
  File "C:\Users\admin\Desktop\F_New\stacko.py", line 10, in <module>
    p = multiprocessing.Pool(4)
  File "<string>", line 1, in <module> (Current frame)

Another ADD Here a zip file contains the sample file with the code https://drive.google.com/file/d/1fulHddGI5Ji5DC1Xe6Lq0wUeMk7-_J5f/view?usp=share_link

Task Manager

Upvotes: 1

Views: 280

Answers (2)

Ahmed AEK
Ahmed AEK

Reputation: 17496

on windows you have to put your multiprocessing code guarded by an if __name__ == "__main__":, Compulsory usage of if name=="main" in windows while using multiprocessing [duplicate]

you also need to use get on the tasks that you launched with apply_async, in order to wait for them to finish, so you should store them in a list and iterate the get on them.

after fixing, your code would look as follows:

import multiprocessing
import glob
import json

def handle_json(file):
    with open(file, 'r', encoding = 'utf-8') as inp, open(file.replace('.json','.txt'), 'a', encoding = 'utf-8', newline = '') as out:
        length = json.load(inp).get('len','') #Note: each json file is not large and well formed
        out.write(f'{file}\t{length}\n')

if __name__ == "__main__":
    p = multiprocessing.Pool(4)
    tasks = []
    for f, file in enumerate(glob.glob("Folder\\*.json")):
        task = p.apply_async(handle_json, [file])
        tasks.append(task)
        print(f)

    for task in tasks:
        task.get()
    p.close()
    p.join() # Wait for all child processes to close.

Upvotes: 1

match
match

Reputation: 11060

The apply_async function in multiprocessing expects the arguments to the called function to be iterable, so you need to do e.g.:

p.apply_async(handle_json, [file])

Upvotes: 1

Related Questions