Reputation: 357
I am trying to convert many pdf files into txt. My pdf files are organized in subdirectories within a directory. So I have three layers: directory --> subdirectories --> multiple pdf files in each subdirectory. I am using the following code which is giving me this error ValueError: too many values to unpack (expected 3)
. The code works when I convert files in a single directory but not in multiple subdirectories.
It might be quite simple but I cannot get my head around it. Any help would be much appreciated. Thanks.
import pytesseract
from pdf2image import convert_from_path
import glob
pdfs = glob.glob(r"K:\pdf_files")
for pdf_path, dirs, files in pdfs:
for file in files:
convert_from_path(os.path.join(pdf_path, file), 500)
for pageNum,imgBlob in enumerate(pages):
text = pytesseract.image_to_string(imgBlob,lang='eng')
with open(f'{pdf_path}.txt', 'a') as the_file:
the_file.write(text)
Upvotes: 4
Views: 9321
Reputation: 1053
import os
import pytesseract
from PIL import Image
from pdf2image import convert_from_path
from PyPDF2 import PdfFileReader
# Path to the folder containing PDF files
input_folder = "d:/doc/doc"
# Path to the folder where text files will be saved
output_folder = "d:/doc/doc"
# Path to the Tesseract OCR executable (change if necessary)
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
# Get a list of all PDF files in the input folder
files = [f for f in os.listdir(input_folder) if f.endswith(".pdf")]
# Loop through each PDF file and convert it to text using OCR
for file in files:
pdf_path = os.path.join(input_folder, file)
txt_path = os.path.join(output_folder, os.path.splitext(file)[0] + ".txt")
# Convert PDF pages to images
images = convert_from_path(pdf_path)
# Perform OCR on images and extract text
text = ""
for image in images:
# text += pytesseract.image_to_string(image)
text += pytesseract.image_to_string(image, lang='ron') # your document language
# Save the extracted text to a text file
with open(txt_path, "w", encoding="utf-8") as txt_file:
txt_file.write(text)
print("Conversion complete!")
Upvotes: 1
Reputation: 15961
As mentioned in the comments, you need os.walk
, not glob.glob
. os.walk
provides you with the directory listing recursively. pdf_path
is the parent dir it's currently listing, dirs
is a list of directories/folders and files
is the list of files in that folder.
Use os.path.join()
to form a full path using the parent folder and the filename.
Also, instead of constantly appending to the txt file, just create it outside the 'page-to-text' loop.
import os
pdfs_dir = r"K:\pdf_files"
for pdf_path, dirs, files in os.walk(pdfs_dir):
for file in files:
if not file.lower().endswith('.pdf'):
# skip non-pdf's
continue
file_path = os.path.join(pdf_path, file)
pages = convert_from_path(file_path, 500)
# change the file extension from .pdf to .txt, assumes
# just one occurrence of .pdf in the name, as the extension
with open(f'{file_path.replace(".pdf", ".txt")}', 'w') as the_file: # write mode, coz one time
for pageNum, imgBlob in enumerate(pages):
text = pytesseract.image_to_string(imgBlob,lang='eng')
the_file.write(text)
Upvotes: 2
Reputation: 357
I have just solved the problem in a simpler way by adding *
to specify all subdirectories in the directory:
import pytesseract
from pdf2image import convert_from_path
import glob
pdfs = glob.glob(r"K:\pdf_files\*\*.pdf")
for pdf_path in pdfs:
pages = convert_from_path(pdf_path, 500)
for pageNum,imgBlob in enumerate(pages):
text = pytesseract.image_to_string(imgBlob,lang='eng')
with open(f'{pdf_path}.txt', 'a') as the_file:
the_file.write(text)
Upvotes: 4