Ethe99
Ethe99

Reputation: 49

python glob or listdir to create then save files from one directory to another

I'm converting documents from pdf to text. The pdfs are currently in one folder and then saved to another after txt conversion. I have many of these documents and would prefer iterating over subfolders and saving to a subfolder with same name in txt folder but having trouble adding that layer.

I understand I can use glob to iterate over recursively and do this for file lists, etc. But unclear how I can save files from this to new folder. This isn't totally necessary but would be much more convenient and efficient.

Is there a good way to do this?

import os
import io
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage


def convert(fname, pages=None):
    if not pages:
        pagenums = set()
    else:
        pagenums = set(pages)

    output = io.StringIO()
    manager = PDFResourceManager()
    converter = TextConverter(manager, output, laparams=LAParams())
    interpreter = PDFPageInterpreter(manager, converter)

    infile = open(fname, 'rb')
    for page in PDFPage.get_pages(infile, pagenums):
        interpreter.process_page(page)
    infile.close()
    converter.close()
    text = output.getvalue()
    output.close
    return text 
    print(text)



def convertMultiple(pdfDir, txtDir):
    if pdfDir == "": pdfDir = os.getcwd() + "\\" #if no pdfDir passed in 
    for pdf in os.listdir(pdfDir): #iterate through pdfs in pdf directory
        fileExtension = pdf.split(".")[-1]
        if fileExtension == "pdf":
            pdfFilename = pdfDir + pdf 
            text = convert(pdfFilename)  
            textFilename = txtDir + pdf.split(".")[0] + ".txt"
            textFile = open(textFilename, "w")  
            textFile.write(text)  


pdfDir = r"C:/Users/Documents/pdf/"
txtDir = r"C:/Users/Documents/txt/"
convertMultiple(pdfDir, txtDir)   

Upvotes: 0

Views: 1476

Answers (1)

GordonAitchJay
GordonAitchJay

Reputation: 4860

As you suggested, glob works nicely here. It can even filter only .pdf files.

Uncomment the 3 lines after you've tested it.

import os, glob

def convert_multiple(pdf_dir, txt_dir):
    if pdf_dir == "": pdf_dir = os.getcwd() # If no pdf_dir passed in 
    for filepath in glob.iglob(f"{pdf_dir}/**/*.pdf", recursive=True):
        text = convert(filepath)
        root, _ = os.path.splitext(filepath) # Remove extension
        txt_filepath = os.path.join(txt_dir, os.path.relpath(root, pdf_dir)) + ".txt"
        txt_filepath = os.path.normpath(txt_filepath) # Not really necessary
        print(txt_filepath)
#        os.makedirs(os.path.dirname(txt_filepath), exist_ok=True)
#        with open(txt_filepath, "wt") as f:
#            f.write(text)


pdf_dir = r"C:/Users/Documents/pdf/"
txt_dir = r"C:/Users/Documents/txt/"
convert_multiple(pdf_dir, txt_dir)   

To determine the filepath for the new .txt file, use functions from the os.path module.

os.path.relpath(filepath, pdf_dir) returns the filepath of the file including any subdirectories relative to the pdf_dir.

Suppose filepath is:

C:/Users/Documents/pdf/Setec Astronomy/employees.pdf

and pdf_dir is

C:/Users/Documents/pdf/

It would return Setec Astronomy/employees.pdf which can then be passed into os.path.join() along with txt_dir, giving us the complete filepath with the extra subdirectories included.

You could do txt_filepath = filepath.replace(filepath, pdf_dir), but you'd have to make sure all the corresponding slashes are in the same direction, and there are no extra/missing leading/trailing slashes.

Before opening the new .txt file, any and all subdirectories need to be created. os.path.dirname() is called to get the filepath of the file's directory, and os.makedirs() with its exist_ok argument set to True, to suppress the FileExistsError exception if the directory already exists.

A with statement is used when opening the .txt file to avoid explicitly calling .close(), especially in case of any exceptions.

Upvotes: 1

Related Questions