Ryan Adams
Ryan Adams

Reputation: 77

How do I extract all of the text from a PDF using indexing

I am new to Python and coding in general. I'm trying to create a program that will OCR a directory of PDFs then extract the text so I can later pick out specific things. However, I am having trouble getting pdfPlumber to extract all the text from all of the pages. You can index from start to an end, but if the end is unknown, it breaks because the index is out of range.

import ocrmypdf
import os
import requests
import pdfplumber
import re
import logging
import sys
import PyPDF2

## test folder C:\Users\adams\OneDrive\Desktop\PDF

user_direc = input("Enter the path of your files: ") 

#walks the path and prints out each PDF in the 
#OCRs the documents and skips any OCR'd pages.


for dir_name, subdirs, file_list in os.walk(user_direc):
    logging.info(dir_name + '\n')
    os.chdir(dir_name)
    for filename in file_list:
        file_ext = os.path.splitext(filename)[0--1]
        if file_ext == '.pdf':
            full_path = dir_name + '/' + filename
            print(full_path)
result = ocrmypdf.ocr(filename, filename, skip_text=True, deskew = True, optimize = 1) 
logging.info(result)

#the next step is to extract the text from each individual document and print

directory = os.fsencode(user_direc)
    
for file in os.listdir(directory):
    filename = os.fsdecode(file)
    if filename.endswith('.pdf'):
        with pdfplumber.open(file) as pdf:
            page = pdf.pages[0]
            text = page.extract_text()
            print(text)  

As is, this will only take the text from the first page of each PDF. I want to extract all of the text from each PDF but pdfPlumber will break if my index is too large and I do not know the number of pages the PDF will have. I've tried

page = pdf.pages[0--1]

but this breaks as well. I have not been able to find a workaround with PyPDF2, either. I apologize if this sloppy code or unreadable. I've tried to add comments to kind of explain what I am doing.

Upvotes: 7

Views: 23934

Answers (2)

mpriya
mpriya

Reputation: 893

If you encounter this error when you try the above mentioned code:

fp = open(path_or_fp, "rb") FileNotFoundError: [Errno 2] No such file or directory:

this is because os.listdir() gives only filename and you have to join it with directory. The os.listdir() function will return names relative to the directory you're listing then. You need to reconstruct the absolute path to open those files.

To resolve this error try the below code:

import os
import pdfplumber

directory = r'C:\Users\foo\folder'

for filename in os.listdir(directory):
    if filename.endswith('.pdf'):
        fullpath = os.path.join(directory, filename)
        #print(fullpath)
        all_text = ""
        with pdfplumber.open(fullpath) as pdf:
            for page in pdf.pages:
                text = page.extract_text()
                #print(text)
                all_text += '\n' + text
        print(all_text)

Reference: Extract text from pdf file using pdfplumber

Upvotes: 0

mark_s
mark_s

Reputation: 496

The pdfplumber git page says pdfplumber.open returns an instance of the pdfplumber.PDF class.

That instance has the pages property which is a list of pdfplumber.Page instances - one per Page loaded from your pdf. Looking at your code, if you do:

total_pages = len(pdf.pages)

You should get the total pages for the currently loaded pdf.

To combine all the pdf's text into one giant text string, you could try the 'for in' operation. Try changing your existing code:

for file in os.listdir(directory):
    filename = os.fsdecode(file)
    if filename.endswith('.pdf'):
        with pdfplumber.open(file) as pdf:
            page = pdf.pages[0]
            text = page.extract_text()
            print(text)  

To:

for file in os.listdir(directory):
    filename = os.fsdecode(file)
    if filename.endswith('.pdf'):
        all_text = '' # new line
        with pdfplumber.open(file) as pdf:
            # page = pdf.pages[0] - comment out or remove line
            # text = page.extract_text() - comment out or remove line
            for pdf_page in pdf.pages:
               single_page_text = pdf_page.extract_text()
               print( single_page_text )
               # separate each page's text with newline
               all_text = all_text + '\n' + single_page_text
            print(all_text)
            # print(text) - comment out or remove line  

Rather than use the page's index value pdf.page[0] to access individual pages, use for pdf_page in pdf.pages. It will stop looping after it reaches the last page without generating an Exception. You won't have to worry about using an index value that's out of range.

Upvotes: 19

Related Questions