Sarthak Kumar
Sarthak Kumar

Reputation: 304

python PyPDF2 - Special characters are printing while tring to print text from pdf file?

I am trying to print text from the pdf file using PyPDF2 module but some special characters are printing.
already tried this solution but it does not seems to work.
code

import PyPDF2

obj = open('/home/sarthak/Documents/UNIT-4.pdf','rb')

pdfReader = PyPDF2.PdfFileReader(obj)

print(pdfReader.numPages)   #printing No. of pages

pageObj = pdfReader.getPage(0)

print(pageObj.extractText().encode('ascii','ignore'))    #also used 'utf-8' but doesn't work either

obj.close()

output

17
b'\n\n\n\n!#$\n\n\n\n\n\n\n\n\n\n\n  \n\n"%$\n\n\n"#\n\n\n $\n\n\n\'())(*+, -$&\n\n\n\n\n $&-\n $\n'

Upvotes: 0

Views: 1054

Answers (1)

Jinu Joseph
Jinu Joseph

Reputation: 702

For removing /n u can pass the result in textacy.

import textacy
data=textacy.preprocess.remove_punct(section, marks='\n'))
print(data)

wheresection is the extracted data

for installing textacy pip install textacy

Upvotes: 1

Related Questions