Reputation: 242
I have been trying to extract text from a PDF file containing text in Hindi (Devanagari) and stored the value in a text file.
Can you help me with extracting Hindi text from the PDF using PyPDF2 instead of pdfminer and other tools?
This is my recent code that gives error:
import PyPDF2 as ppdf
import codecs
pdfobj=open('hindi.pdf',mode='rb')
pdfread = ppdf.PdfFileReader(pdfobj)
page=pdfread.getPage(1)
text=page.extractText().encode('utf-8')
print(text)
but this returns junk values like this:
204 0,*L !*+,-./, 0(1,#.23)#*+ ,#- @'#7<1593=? @'#7< :2
Upvotes: 2
Views: 805
Reputation: 136307
Use a recent version of a pypdf
:
from pypdf import PdfReader
reader = PdfReader("example.pdf")
page = reader.pages[0]
print(page.extract_text())
Upvotes: 1