Reputation: 49
I'm converting documents from pdf to text. The pdfs are currently in one folder and then saved to another after txt conversion. I have many of these documents and would prefer iterating over subfolders and saving to a subfolder with same name in txt folder but having trouble adding that layer.
I understand I can use glob to iterate over recursively and do this for file lists, etc. But unclear how I can save files from this to new folder. This isn't totally necessary but would be much more convenient and efficient.
Is there a good way to do this?
import os
import io
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
def convert(fname, pages=None):
if not pages:
pagenums = set()
else:
pagenums = set(pages)
output = io.StringIO()
manager = PDFResourceManager()
converter = TextConverter(manager, output, laparams=LAParams())
interpreter = PDFPageInterpreter(manager, converter)
infile = open(fname, 'rb')
for page in PDFPage.get_pages(infile, pagenums):
interpreter.process_page(page)
infile.close()
converter.close()
text = output.getvalue()
output.close
return text
print(text)
def convertMultiple(pdfDir, txtDir):
if pdfDir == "": pdfDir = os.getcwd() + "\\" #if no pdfDir passed in
for pdf in os.listdir(pdfDir): #iterate through pdfs in pdf directory
fileExtension = pdf.split(".")[-1]
if fileExtension == "pdf":
pdfFilename = pdfDir + pdf
text = convert(pdfFilename)
textFilename = txtDir + pdf.split(".")[0] + ".txt"
textFile = open(textFilename, "w")
textFile.write(text)
pdfDir = r"C:/Users/Documents/pdf/"
txtDir = r"C:/Users/Documents/txt/"
convertMultiple(pdfDir, txtDir)
Upvotes: 0
Views: 1476
Reputation: 4860
As you suggested, glob
works nicely here. It can even filter only .pdf
files.
Uncomment the 3 lines after you've tested it.
import os, glob
def convert_multiple(pdf_dir, txt_dir):
if pdf_dir == "": pdf_dir = os.getcwd() # If no pdf_dir passed in
for filepath in glob.iglob(f"{pdf_dir}/**/*.pdf", recursive=True):
text = convert(filepath)
root, _ = os.path.splitext(filepath) # Remove extension
txt_filepath = os.path.join(txt_dir, os.path.relpath(root, pdf_dir)) + ".txt"
txt_filepath = os.path.normpath(txt_filepath) # Not really necessary
print(txt_filepath)
# os.makedirs(os.path.dirname(txt_filepath), exist_ok=True)
# with open(txt_filepath, "wt") as f:
# f.write(text)
pdf_dir = r"C:/Users/Documents/pdf/"
txt_dir = r"C:/Users/Documents/txt/"
convert_multiple(pdf_dir, txt_dir)
To determine the filepath for the new .txt
file, use functions from the os.path
module.
os.path.relpath(filepath, pdf_dir)
returns the filepath of the file including any subdirectories relative to the pdf_dir
.
Suppose filepath
is:
C:/Users/Documents/pdf/Setec Astronomy/employees.pdf
and pdf_dir
is
C:/Users/Documents/pdf/
It would return Setec Astronomy/employees.pdf
which can then be passed into os.path.join()
along with txt_dir
, giving us the complete filepath with the extra subdirectories included.
You could do txt_filepath = filepath.replace(filepath, pdf_dir)
, but you'd have to make sure all the corresponding slashes are in the same direction, and there are no extra/missing leading/trailing slashes.
Before opening the new .txt
file, any and all subdirectories need to be created. os.path.dirname()
is called to get the filepath of the file's directory, and os.makedirs()
with its exist_ok
argument set to True
, to suppress the FileExistsError
exception if the directory already exists.
A with
statement is used when opening the .txt
file to avoid explicitly calling .close()
, especially in case of any exceptions.
Upvotes: 1