Reputation: 774
I searched for my question and did not get my answer in the two available questions
Basically I want to iterate over each page because I want to select only that page which has a certain text.
I have used pyPdf
. It works for almost i can say 90% of the pdfs
but sometimes it does not extract the information from a page.
I have used the below code:
import pyPdf
extract = ""
pdf = pyPdf.PdfFileReader(open('filename.pdf', "rb"))
num_of_pages = pdf.getNumPages()
for p in range(num_of_pages):
ex = pdf.getPage(6)
ex = ex.extractText()
if re.search(r"to be held (at|on)",ex.lower()):
print 'yes'
print ex ,"\n"
extract = extract + ex + "\n"
continue
The above code works but sometimes some pages don't get extracted.
I also tried using pdfminer
, but i could not find how to iterate the pdf in it page by page. pdfminer
returns the entire text of the pdf.
I used the below code:
def convert_pdf_to_txt(path):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
fp = file(path, 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
pagenos=set()
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
interpreter.process_page(page)
text = retstr.getvalue()
fp.close()
device.close()
retstr.close()
return text
In the above code the text from the pdf comes from the for
loop
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
interpreter.process_page(page)
text = retstr.getvalue()
In this how can I iterated on one page at a time.
The documentation on pdfminer
is not understandable. Also there are many versions of the same.
So are there any other packages available for my question or can pdfminer
be used for it?
Upvotes: 2
Views: 16518
Reputation: 1
You can refer the following link to extract page by page text from PDF.
from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer
for page_layout in extract_pages("test.pdf"):
for element in page_layout:
if isinstance(element, LTTextContainer):
print(element.get_text())
PDFMiner Page by Page text Extraction
Upvotes: 0
Reputation: 61
Because retstr will retain each page, you might consider altering your code by calling retstr.truncate(0) which clears the string each time, otherwise you're printing the entirety of what's already been read each time:
import pyPdf
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO
path = "filename.pdf"
pdf = pyPdf.PdfFileReader(open(path, "rb"))
fp = file(path, 'rb')
num_of_pages = pdf.getNumPages()
extract = ""
for i in range(num_of_pages):
inside = [i]
pagenos=set(inside)
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
text = ""
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
interpreter.process_page(page)
text = retstr.getvalue()
retstr.truncate(0)
text = text.decode("ascii","replace")
if re.search(r"to be held (at|on)",text.lower()):
print text
extract = extract + text + "\n"
continue
Upvotes: 6
Reputation: 774
I know it is not good to answer your own question but i think i may have figured out an answer for this question.
I think it is not the best way to do it, but still it helps me.
I used a combination of pypdf
and pdfminer
The code is as below:
import pyPdf
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO
path = "filename.pdf"
pdf = pyPdf.PdfFileReader(open(path, "rb"))
fp = file(path, 'rb')
num_of_pages = pdf.getNumPages()
extract = ""
for i in range(num_of_pages):
inside = [i]
pagenos=set(inside)
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
text = ""
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
interpreter.process_page(page)
text = retstr.getvalue()
text = text.decode("ascii","replace")
if re.search(r"to be held (at|on)",text.lower()):
print text
extract = extract + text + "\n"
continue
There may be a better way to do it, but currently i found out this to be pretty good.
Upvotes: 3