Reputation: 34160
Guys i had posted a question earlier pypdf python tool .dont mark this as duplicate as i get this error indicated below
import sys
import pyPdf
def convertPdf2String(path):
content = ""
# load PDF file
pdf = pyPdf.PdfFileReader(file(path, "rb"))
# iterate pages
for i in range(0, pdf.getNumPages()):
# extract the text from each page
content += pdf.getPage(i).extractText() + " \n"
# collapse whitespaces
content = u" ".join(content.replace(u"\xa0", u" ").strip().split())
return content
# convert contents of a PDF file and store retult to TXT file
f = open('a.txt','w+')
f.write(convertPdf2String(sys.argv[1]))
f.close()
# or print contents to the standard out stream
print convertPdf2String("/home/tom/Desktop/Hindi_Book.pdf").encode("ascii", "xmlcharrefreplace")
I get this error for a the 1st pdf file
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)
and the following error for this pdf http://www.envis-icpe.com/pointcounterpointbook/Hindi_Book.pdf
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe7' in position 38: ordinal not in range(128)
How to resolve this
Upvotes: 1
Views: 8483
Reputation: 70108
I tried it myself and got the same result. Ignore my comment, I hadn't seen that you're writing the output to a file as well. This is the problem:
f.write(convertPdf2String(sys.argv[1]))
As convertPdf2String
returns a Unicode string, but file.write
can only write bytes, the call to f.write
tries to automatically convert the Unicode string using ASCII encoding. As the PDF obviously contains non-ASCII characters, that fails. So it should be something like
f.write(convertPdf2String(sys.argv[1]).encode("utf-8"))
# or
f.write(convertPdf2String(sys.argv[1]).encode("ascii", "xmlcharrefreplace"))
EDIT:
The working source code, only one line changed.
# Execute with "Hindi_Book.pdf" in the same directory
import sys
import pyPdf
def convertPdf2String(path):
content = ""
# load PDF file
pdf = pyPdf.PdfFileReader(file(path, "rb"))
# iterate pages
for i in range(0, pdf.getNumPages()):
# extract the text from each page
content += pdf.getPage(i).extractText() + " \n"
# collapse whitespaces
content = u" ".join(content.replace(u"\xa0", u" ").strip().split())
return content
# convert contents of a PDF file and store retult to TXT file
f = open('a.txt','w+')
f.write(convertPdf2String(sys.argv[1]).encode("ascii", "xmlcharrefreplace"))
f.close()
# or print contents to the standard out stream
print convertPdf2String("Hindi_Book.pdf").encode("ascii", "xmlcharrefreplace")
Upvotes: 2