How to properly utilize the multiprocessing module in Python?

Question

I have 110 PDFs that I'm trying to extract images from. Once the images are extracted, I'd like to remove any duplicates and delete images that are less than 4KB. My code to do that looks like this:

def extract_images_from_file(pdf_file):
    file_name = os.path.splitext(os.path.basename(pdf_file))[0]
    call(["pdfimages", "-png", pdf_file, file_name])
    os.remove(pdf_file)

def dedup_images():
    os.mkdir("unique_images")
    md5_library = []
    images = glob("*.png")
    print "Deleting images smaller than 4KB and generating the MD5 hash values for all other images..."
    for image in images:
        if os.path.getsize(image) <= 4000:
            os.remove(image)
        else:
            m = md5.new()
            image_data = list(Image.open(image).getdata())
            image_string = "".join(["".join([str(tpl[0]), str(tpl[1]), str(tpl[2])]) for tpl in image_data])
            m.update(image_string)
            md5_library.append([image, m.digest()])
    headers = ['image_file', 'md5']
    dat = pd.DataFrame(md5_library, columns=headers).sort(['md5'])
    dat.drop_duplicates(subset="md5", inplace=True)

    print "Extracting the unique images."
    unique_images = dat.image_file.tolist()
    for image in unique_images:
        old_file = image
        new_file = "unique_images\" + image
        shutil.copy(old_file, new_file)

This process can take a while, so I've started to dabble in multithreading. Feel free to interpret that as me saying I have no idea what I'm doing. I thought the process would be easily parallelisable with regard to extracting the images, but not deduping since there's a lot of I/O going on with one file and I have no idea how to do that. So here's my attempt at the parallel process:

if __name__ == '__main__':
    filepath = sys.argv[1]
    folder_name = os.getcwd() + "\all_images\"
    if not os.path.exists(folder_name):
        os.mkdir(folder_name)
    pdfs = glob("*.pdf")
    print "Copying all PDFs to the images folder..."
    for pdf in pdfs:
        shutil.copy(pdf, ".\all_images\")
    os.chdir("all_images")
    pool = Pool(processes=8)
    print "Extracting images from PDFs..."
    pool.map(extract_images_from_file, pdfs)
    print "Extracting unique images into a new folder..."
    dedup_images()
    print "All images have been extracted and deduped."

Everything seems to have worked fine when extracting the images, but then it all went haywire. So here are my questions:

1) Am I setting up the parallel process correctly?
2) Does it continue to try to use all 8 processors on dedup_images()?
3) Is there anything I'm missing and/or not doing correctly?

Thanks in advance!

EDIT Here is what I mean by "haywire". The errors start out with a bunch of lines like this:

I/O Error: Couldn't open image If/iOl eE r'rSourb:p oICe/onOua l EdNrner'wot r Y:oo prCekon u Cliodmunan'gttey   of1pi0e
l2ne1  1i'4mS auogbiepl o2fefinrlaee e N@'egSwmu abYipolor ekcn oaCm o Nupentwt  y1Y -o18r16k11 8.C1po4nu gn3't4
y7 5160120821143  3p4t7I 9/49O-8 88E78r81r.3op rnp:gt ' C
3o-u3l6d0n.'ptn go'p
en image file 'Ia/ ON eEwr rYoorr:k  CCIoo/uuOln dtEnyr' rt1o 0ro2:p1 e1Cn4o  uiolmidalng2'eft r m '
ai gpceoo emfn iapl teN  e1'w-S 8uY6bo2pr.okpe nnCgao' u
Nnetwy  Y1o0r2k8 1C4o u3n4t7y9 918181881134  3p4t7 536-1306211.3p npgt'
4-879.png'
I/O Error: CoulId/nO' tE rorpoern:  iCmoaugled nf'itl eo p'eub piomeangae  fNielwe  Y'oSrukb pCooeunnat yN e1w0 2Y8o1r
4k  3C4o7u9n9t8y8 811032 1p1t4  3o-i3l622f pt 1-863.png'

And then gets more readable with multiple lines like this:

I/O Error: Couldn't open image file 'pt 1-864.png'
I/O Error: Couldn't open image file 'pt 1-865.png'
I/O Error: Couldn't open image file 'pt 1-866.png'
I/O Error: Couldn't open image file 'pt 1-867.png'

This repeats for a while, going back and forth between the garbled error text and the readable.

Finally, it gets to here:

Deleting images smaller than 4KB and generating the MD5 hash values for all other images...
Extracting unique images into a new folder...

which implies that the code picks back up and continues on with the process. What could be going wrong?

strubbly · Accepted Answer

Your code is basically fine.

The garbled text is all the processes trying to write different versions of the I/O Error message interleaved to the console. The error message is being generated by the pdfimages command, probably because when you run two at once they conflict, possibly over temporary files, or both using the same file name or something like that.

Try using a different image root for each separate pdf file.

How to properly utilize the multiprocessing module in Python?

Answers (2)

Related Questions