Python3 multiprocessing

Question

I am an absolute beginner. I fumble my way through code by analogy to examples so apologies for any misuse of terminology.

I have written a small piece of code in python 3 which:

takes a user input (a folder on their computer)
searches the folder for pdf files
turns each page of the PDF to an image with sequential numbering. Iterates through the jpgs in order of numbering, turning them black and white. OCR scans the files and outputs the text into an object, saves the text contents to a .txt file (via pytesseract). Deletes jpgs, leaving .txt file. Most time is taken in converting to jpgs and possibly making them black and white.

The code works, though I am sure it could be improved. It takes a while so I thought I'd try multiprocessing using Pools.

My code appears to create pools. I can also get the function to print a list of files in the folder, so it appears to have the list passed to it in one form or another.

I cannot get it to work and have now hacked the code about repeatedly with various errors. I think the main problem is, I am clueless.

My code begins:

User input block (asks for a folder in the user's directory, checks it is a valid folder etc).

OCR block as a function (parses PDF then outputs contents into single .txt file)

For loop block as a function (is supposed to loop over each PDF in folder and execute OCR block on it.

Multiprocessing block (is supposed to feed the list of files in the directory to the loop block.

To avoid writing War and Peace, I set out last version of the loop block and multiprocessing blocks below:

    #import necessary modules


home_path = os.path.expanduser('~')

#ask for input with various checking mechanisms to make sure a useful pdfDir is obtained
    pdfDir = home_path + '/Documents/' + input('Please input the folder name where the PDFs are stored. The folder must be directly under the Documents folder. It cannot have a space in it. 
 
 Name of folder:') 




def textExtractor():
    #convert pdf to jpeg with a tesseract friendly resolution

    with Img(filename=pdf_filename, resolution=300) as img: #some can be encrypted so use OCR instead of other libraries

    #various lines of code here

    compilation_temp.close()

def per_file_process (subject_files):

        for pdf in subject_files:

            #decode the whole file name as a string 
            pdf_filename = os.fsdecode(pdf)

            #check whether the string ends in .pdf

        if pdf_filename.endswith(".pdf"):

            #call the OCR function on it
            textExtractor()


        else:
            print ('nonsense')


if __name__ == '__main__':

    pool = Pool(2)

    pool.map(per_file_process, os.listdir(pdfDir))

Is anyone willing/able to point out my errors, please?

The relevant bits of the code whilst working:

#import necessary

home_path = os.path.expanduser('~')

#block accepting input

    pdfDir = home_path + '/Documents/' + input('Please input the folder name where the PDFs are stored. The folder must be directly under the Documents folder. It cannot have a space in it. 
 
 Name of folder:') 



def textExtractor():
    #convert pdf to jpeg with a tesseract friendly resolution

    with Img(filename=pdf_filename, resolution=300) as img: #need to think about using generic expanduser or other libraries to allow portability
    #various lines of code to OCR and output .txt file
    compilation_temp.close()


subject_files = os.listdir(pdfDir)
for pdf in subject_files:
         #decode the whole file name as a string you can see
         pdf_filename = os.fsdecode(pdf)
        #check whether the string ends in /pdf
        if pdf_filename.endswith(".pdf"):
            textExtractor()

        else:
            #print for debugging

tdelaney · Accepted Answer

Pool.map calls the worker function repeatedly with each name returned by os.listdir. In per_file_process, subject_files is a single filename and for pdf in subject_files: is enumerating the individual characters in the name. Further, listdir only shows the base name, without subdirectories, so you aren't looking in the right place for the pdf. You can use glob to filter by extension name and return a working path to the file.

Your example is confusing... textExtractor() takes no parameters so how is it to know which file it is processing? I'm going out on a limb and assuming that it really does take the path to the file processing. If so, you can parallelize rather easily just by feeding pdf's directory it via map. Assuming processing time will vary by pdf, I am setting chunksize to 1 so that an early finishing worker can grap extra files to process.

from glob import glob
import os
from multiprocessing import Pool

def textExtractor(pdf_filename):
    #convert pdf to jpeg with a tesseract friendly resolution
    with Img(filename=pdf_filename, resolution=300) as img: #some can be encrypted so use OCR instead of other libraries

        #...various lines of code here
    compilation_temp.close()

if __name__ == '__main__':
    #pdfDir is the folder inputted by user
    with Pool(2) as pool:
        # assuming call signature: textExtractor(path_to_file)
        pool.map(textExtractor, 
            (filename for filename in glob(os.path.join(pdfDir, '*.pdf'))
            if os.path.isfile(filename))
            chunksize=1)

Python3 multiprocessing

Answers (1)

Related Questions