crackers
crackers

Reputation: 357

PDF to text convert using python pytesseract

I am trying to convert many pdf files into txt. My pdf files are organized in subdirectories within a directory. So I have three layers: directory --> subdirectories --> multiple pdf files in each subdirectory. I am using the following code which is giving me this error ValueError: too many values to unpack (expected 3). The code works when I convert files in a single directory but not in multiple subdirectories.

It might be quite simple but I cannot get my head around it. Any help would be much appreciated. Thanks.

import pytesseract
from pdf2image import convert_from_path
import glob

pdfs = glob.glob(r"K:\pdf_files")

for pdf_path, dirs, files in pdfs:
    for file in files:
    convert_from_path(os.path.join(pdf_path, file), 500)

        for pageNum,imgBlob in enumerate(pages):
            text = pytesseract.image_to_string(imgBlob,lang='eng')

            with open(f'{pdf_path}.txt', 'a') as the_file:
                the_file.write(text)

Upvotes: 4

Views: 9321

Answers (3)

Just Me
Just Me

Reputation: 1053

import os
import pytesseract
from PIL import Image
from pdf2image import convert_from_path
from PyPDF2 import PdfFileReader

# Path to the folder containing PDF files
input_folder = "d:/doc/doc"

# Path to the folder where text files will be saved
output_folder = "d:/doc/doc"

# Path to the Tesseract OCR executable (change if necessary)
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"

# Get a list of all PDF files in the input folder
files = [f for f in os.listdir(input_folder) if f.endswith(".pdf")]

# Loop through each PDF file and convert it to text using OCR
for file in files:
    pdf_path = os.path.join(input_folder, file)
    txt_path = os.path.join(output_folder, os.path.splitext(file)[0] + ".txt")

    # Convert PDF pages to images
    images = convert_from_path(pdf_path)

    # Perform OCR on images and extract text
    text = ""
    for image in images:
        # text += pytesseract.image_to_string(image)
        text += pytesseract.image_to_string(image, lang='ron') # your document language

    # Save the extracted text to a text file
    with open(txt_path, "w", encoding="utf-8") as txt_file:
        txt_file.write(text)

print("Conversion complete!")

Upvotes: 1

aneroid
aneroid

Reputation: 15961

As mentioned in the comments, you need os.walk, not glob.glob. os.walk provides you with the directory listing recursively. pdf_path is the parent dir it's currently listing, dirs is a list of directories/folders and files is the list of files in that folder.

Use os.path.join() to form a full path using the parent folder and the filename.

Also, instead of constantly appending to the txt file, just create it outside the 'page-to-text' loop.

import os

pdfs_dir = r"K:\pdf_files"

for pdf_path, dirs, files in os.walk(pdfs_dir):
    for file in files:
        if not file.lower().endswith('.pdf'):
            # skip non-pdf's
            continue
        
        file_path = os.path.join(pdf_path, file)
        pages = convert_from_path(file_path, 500)
        
        # change the file extension from .pdf to .txt, assumes
        # just one occurrence of .pdf in the name, as the extension
        with open(f'{file_path.replace(".pdf", ".txt")}', 'w') as the_file:  # write mode, coz one time
            for pageNum, imgBlob in enumerate(pages):
                text = pytesseract.image_to_string(imgBlob,lang='eng')
                the_file.write(text)

Upvotes: 2

crackers
crackers

Reputation: 357

I have just solved the problem in a simpler way by adding * to specify all subdirectories in the directory:

import pytesseract
from pdf2image import convert_from_path
import glob

pdfs = glob.glob(r"K:\pdf_files\*\*.pdf")

for pdf_path in pdfs:
    pages = convert_from_path(pdf_path, 500)

    for pageNum,imgBlob in enumerate(pages):
        text = pytesseract.image_to_string(imgBlob,lang='eng')

        with open(f'{pdf_path}.txt', 'a') as the_file:
            the_file.write(text)

Upvotes: 4

Related Questions