Rohan Amrute
Rohan Amrute

Reputation: 774

Read pdf page by page

I searched for my question and did not get my answer in the two available questions

  1. Extract text per page with Python pdfMiner?

  2. PDFMiner - Iterating through pages and converting them to text

Basically I want to iterate over each page because I want to select only that page which has a certain text.

I have used pyPdf. It works for almost i can say 90% of the pdfs but sometimes it does not extract the information from a page.

I have used the below code:

import pyPdf
extract = ""        
pdf = pyPdf.PdfFileReader(open('filename.pdf', "rb"))
num_of_pages = pdf.getNumPages()
for p in range(num_of_pages):
  ex = pdf.getPage(6)
  ex = ex.extractText()
  if re.search(r"to be held (at|on)",ex.lower()):
    print 'yes'
    print  ex ,"\n"
    extract = extract + ex + "\n" 
    continue

The above code works but sometimes some pages don't get extracted.

I also tried using pdfminer, but i could not find how to iterate the pdf in it page by page. pdfminer returns the entire text of the pdf.

I used the below code:

def convert_pdf_to_txt(path):
  rsrcmgr = PDFResourceManager()
  retstr = StringIO()
  codec = 'utf-8'
  laparams = LAParams()
  device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
  fp = file(path, 'rb')
  interpreter = PDFPageInterpreter(rsrcmgr, device)
  password = ""
  maxpages = 0
  caching = True
  pagenos=set()

 for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
    interpreter.process_page(page)

    text = retstr.getvalue()

   fp.close()
   device.close()
   retstr.close()
   return text

In the above code the text from the pdf comes from the for loop

for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
    interpreter.process_page(page)

    text = retstr.getvalue()

In this how can I iterated on one page at a time.

The documentation on pdfminer is not understandable. Also there are many versions of the same.

So are there any other packages available for my question or can pdfminer be used for it?

Upvotes: 2

Views: 16518

Answers (3)

Nisarg Sheth
Nisarg Sheth

Reputation: 1

You can refer the following link to extract page by page text from PDF.

from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer
for page_layout in extract_pages("test.pdf"):
    for element in page_layout:
        if isinstance(element, LTTextContainer):
            print(element.get_text())

PDFMiner Page by Page text Extraction

Upvotes: 0

fiftyseventheory
fiftyseventheory

Reputation: 61

Because retstr will retain each page, you might consider altering your code by calling retstr.truncate(0) which clears the string each time, otherwise you're printing the entirety of what's already been read each time:

import pyPdf
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO

path = "filename.pdf"
pdf = pyPdf.PdfFileReader(open(path, "rb"))
fp = file(path, 'rb')
num_of_pages = pdf.getNumPages()
extract = ""
for i in range(num_of_pages):
  inside = [i]
  pagenos=set(inside)
  rsrcmgr = PDFResourceManager()
  retstr = StringIO()
  codec = 'utf-8'
  laparams = LAParams()
  device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
  interpreter = PDFPageInterpreter(rsrcmgr, device)
  password = ""
  maxpages = 0
  caching = True
  text = ""
  for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
    interpreter.process_page(page)
    text = retstr.getvalue()
    retstr.truncate(0)
    text = text.decode("ascii","replace")
    if re.search(r"to be held (at|on)",text.lower()):
        print text
        extract = extract + text + "\n" 
        continue

Upvotes: 6

Rohan Amrute
Rohan Amrute

Reputation: 774

I know it is not good to answer your own question but i think i may have figured out an answer for this question.

I think it is not the best way to do it, but still it helps me.

I used a combination of pypdf and pdfminer

The code is as below:

import pyPdf
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO

path = "filename.pdf"
pdf = pyPdf.PdfFileReader(open(path, "rb"))
fp = file(path, 'rb')
num_of_pages = pdf.getNumPages()
extract = ""
for i in range(num_of_pages):
  inside = [i]
  pagenos=set(inside)
  rsrcmgr = PDFResourceManager()
  retstr = StringIO()
  codec = 'utf-8'
  laparams = LAParams()
  device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
  interpreter = PDFPageInterpreter(rsrcmgr, device)
  password = ""
  maxpages = 0
  caching = True
  text = ""
  for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
    interpreter.process_page(page)
    text = retstr.getvalue()
    text = text.decode("ascii","replace")
    if re.search(r"to be held (at|on)",text.lower()):
        print text
        extract = extract + text + "\n" 
        continue

There may be a better way to do it, but currently i found out this to be pretty good.

Upvotes: 3

Related Questions