Zanam
Zanam

Reputation: 4807

Deleting files with multiprocessing in python

I am using the following code to delete large number of files in python:

import os
from multiprocessing import Pool

def deleteFiles(loc):
    def Fn_deleteFiles(inp):
        [fn, loc] = [inp['fn'], inp['loc']]
        os.remove(os.path.join(loc, fn))

    p = Pool(5)
    for path, subdirs, files in os.walk(loc):
        if len(files) > 0:
            inpData = [{'fn':x, 'loc':loc} for x in files]
            p.map(Fn_deleteFiles, inpData)
    p.close()

if __name__ == '__main__':
    loc = r'C:\myDriveWithFilesToDelete'
    deleteFiles(loc)

I get the following error:

  File "C:\Program Files\Python 3.5\lib\multiprocessing\reduction.py", line 50, in dumps
    cls(buf, protocol).dump(obj)
AttributeError: Can't pickle local object 'deleteFiles.<locals>.Fn_deleteFiles'

Upvotes: 1

Views: 1683

Answers (1)

The Matt
The Matt

Reputation: 1724

The problem is that you are creating a function, inside of a function.

The function Fn_deleteFiles(inp), is defined inside of deleteFiles(loc).

This means that Fn_deleteFiles(inp) is _only_ made when deleteFiles(loc) is run.

The problem is that, internally, multiprocessing.pool.Pool() calls the pickle library to transfer function objects from this python process, to the one new python function that is being spawned.

However, pickle will fail to stringify a function, if it can not locate the functions origin.

Here is a demo that demonstrates a similar error.

import pickle
def foo():
    def bar():
        return "Hello"
    return bar

bar = foo()

if __name__ == '__main__':
    s = pickle.dumps(bar)

Will cause the same error:

Traceback (most recent call last):
  File ".../stacktest.py", line 10, in <module>
    s = pickle.dumps(bar)
AttributeError: Can't pickle local object 'foo.<locals>.bar'

So to fix this error, you can either use multiprocessing.pool.ThreadPool instead, as it does not pickle.

import os
from multiprocessing.pool import ThreadPool as Pool
def deleteFiles(loc):
    def Fn_deleteFiles(inp):
        [fn, loc] = [inp['fn'], inp['loc']]
        os.remove(os.path.join(loc, fn))
    p = Pool(5)
    for path, subdirs, files in os.walk(loc):
        if len(files) > 0:
            inpData = [{'fn':x, 'loc':loc} for x in files]
            p.map(Fn_deleteFiles, inpData)
    p.close()
if __name__ == '__main__':
    loc = 'DriveWithFilesToDelete'
    deleteFiles(loc)

Alternatively, you can define the Fn_deleteFiles(inp) outside of deleteFiles(loc) to fix this issue.

WARNING For reasons I don't understand, this answer will hang inside of the idle interpreter.

import os
from multiprocessing import Pool

def Fn_deleteFiles(inp):
    print("Delete", inp)
    [fn, loc] = [inp['fn'], inp['loc']]
    os.remove(os.path.join(loc, fn))

def deleteFiles(loc):
    p = Pool(5)
    for path, subdirs, files in os.walk(loc):
        if len(files) > 0:
            inpData = [{'fn':x, 'loc':loc} for x in files]
            p.map(Fn_deleteFiles, inpData)
    p.close()

if __name__ == '__main__':
    loc = 'DriveWithFilesToDelete'
    deleteFiles(loc)

Upvotes: 1

Related Questions