Reputation: 11
def extract_pdf(pdf_path):
with open(pdf_path, 'rb') as fh:
# iterate over all pages of PDF document
for page in PDFPage.get_pages(fh, caching=True, check_extractable=True):
# creating a resoure manager
resource_manager = PDFResourceManager()
# create a file handle
fake_file_handle = StringIO()
# creating a text converter object
converter = TextConverter(
resource_manager,
fake_file_handle,
codec='utf-8',
laparams=LAParams()
)
# creating a page interpreter
page_interpreter = PDFPageInterpreter(
resource_manager,
converter
)
# process current page
page_interpreter.process_page(page)
# extract text
text = fake_file_handle.getvalue()
yield text
# close open handles
converter.close()
fake_file_handle.close()
text = ''
for page in extract_pdf('Path of the PDF Document'):
text += page
Through this code, I was able to extract many PDF documents. but when I tested it on other random PDFs from the internet, it starts fluctuating and then the extracted text is not there as an output. When I checked the type of the text, it was showing <class 'str'>
.
Can someone rectify any such errors which I had overlooked while writing this code?
Upvotes: 1
Views: 415
Reputation: 495
import PyPDF2
o = open('example.pdf', 'rb')
r = PyPDF2.PdfFileReader(o)
for page in range(r.numPages):
Obj = r.getPage(page)
print Obj.extractText()
Upvotes: 1