Reputation: 304
I am trying to print text from the pdf file using PyPDF2 module but some special characters are printing.
already tried this solution but it does not seems to work.
code
import PyPDF2
obj = open('/home/sarthak/Documents/UNIT-4.pdf','rb')
pdfReader = PyPDF2.PdfFileReader(obj)
print(pdfReader.numPages) #printing No. of pages
pageObj = pdfReader.getPage(0)
print(pageObj.extractText().encode('ascii','ignore')) #also used 'utf-8' but doesn't work either
obj.close()
output
17
b'\n\n\n\n!#$\n\n\n\n\n\n\n\n\n\n\n \n\n"%$\n\n\n"#\n\n\n $\n\n\n\'())(*+, -$&\n\n\n\n\n $&-\n $\n'
Upvotes: 0
Views: 1054
Reputation: 702
For removing /n u can pass the result in textacy.
import textacy
data=textacy.preprocess.remove_punct(section, marks='\n'))
print(data)
wheresection
is the extracted data
for installing textacy pip install textacy
Upvotes: 1