Python Multithreading issue file manipulation

Question

I've been trying to get the hang of multithreading in Python. However, whenever I attempt to make it do something that might be considered useful, I run into issues.

In this case, I have 300 PDF files. For simplicity, we'll assume that each PDF only has a unique number on it (say 1 to 300). I'm trying to make Python open the file, grab the text from it, and then use that text to rename the file accordingly.

The non-multithreaded version I make works amazing. But it's a bit slow and I thought I'd see if I could speed it up a bit. However, this version finds the very first file, renames it correctly, and then throws an error saying:

FileNotFoundError: [Errno 2] No such file or directory: './pdfPages/1006941.pdf'

Which is it basically telling me that it can't find a file by that name. The reason it can't is because it already named it. And in my head that tells me that I've probably messed something up with this loop and/or multithreading in general.

Any help would be appreciated.

Source:

import PyPDF2
import os
from os import listdir
from os.path import isfile, join
from PyPDF2 import PdfFileWriter, PdfFileReader
from multiprocessing.dummy import Pool as ThreadPool 
# Global
i=0

def readPDF(allFiles):
    print(allFiles)
    global i
    while i < l:
        i+=1
        pdf_file = open(path+allFiles, 'rb')
        read_pdf = PyPDF2.PdfFileReader(pdf_file)
        number_of_pages = read_pdf.getNumPages()
        page = read_pdf.getPage(0)
        page_content = page.extractText()
        pdf_file.close()
        Text = str(page_content.encode('utf-8')).strip("b").strip("'")
        os.rename(path+allFiles,path+pre+"-"+Text+".PDF")       

pre = "77"
path = "./pdfPages/"
included_extensions = ['pdf','PDF']
allFiles = [f for f in listdir(path) if any(f.endswith(ext) for ext in included_extensions)] # Get all files in current directory
l = len(allFiles)

pool = ThreadPool(4)

doThings = pool.map(readPDF, allFiles)

pool.close()
pool.join()

JohanL · Accepted Answer

Yes, you have, in fact, messed up with the loop as you say. The loop should not be there at all. This is implicitly handled by the pool.map(...) that ensures that each function call will receive a unique file name from your list to work with. You should not do any other looping.

I have updated your code below, by removing the loop and some other changes (minor, but still improvements, I think):

# Removed a number of imports
import PyPDF2
import os
from multiprocessing.dummy import Pool as ThreadPool 

# Removed not needed global variable

def readPDF(allFiles):
    # The while loop not needed, as pool.map will distribute the different
    # files to different processes anyway
    print(allFiles)
    pdf_file = open(path+allFiles, 'rb')
    read_pdf = PyPDF2.PdfFileReader(pdf_file)
    number_of_pages = read_pdf.getNumPages()
    page = read_pdf.getPage(0)
    page_content = page.extractText()
    pdf_file.close()
    Text = str(page_content.encode('utf-8')).strip("b").strip("'")
    os.rename(path+allFiles,path+pre+"-"+Text+".PDF")       

pre = "77"
path = "./pdfPages/"
included_extensions = ('pdf','PDF') # Tuple instead of list

# Tuple allows for simpler "F.endswith"
allFiles = [f for f in os.listdir(path) if f.endswith(included_ext)]

pool = ThreadPool(4)

doThings = pool.map(readPDF, allFiles)
# doThings will be a list of "None"s since the readPDF returns nothing

pool.close()
pool.join()

Thus, the global variable and the counter are not needed, since all of that is handled implicitly. But, even with these changes, it is not at all certain that this will speed up your execution much. Most likely, the bulk of your program execution is waiting for the disk to load. In that case, it is possible that even if you have multiple threads, they will still have to wait for the main resource, i.e., the hard drive. But to know for certain, you have to test.

Python Multithreading issue file manipulation

Answers (1)

Related Questions