How to convert PDF files encoded in unicode into text using Python 3 and PyPDF2

Question

I am trying to convert PDFs into text files using Python 3 and PyPDF2 library. But PDFs are mainly written in Korean so it seems to be encoded in 'utf-8' before processing PDF text. But either reading PDF files with "open" function or one with "codecs" function doesn't seem to work at all to extract appropriately 'utf-8'-encode text. Do you have any ideas to extract text from PDF files by using Python 3 and any other relevant Python libraries? Thanks in advance!

(You can download an example file via http://dart.fss.or.kr/pdf/download/pdf.do?rcp_no=20180402005019&dcm_no=6060273)

import PyPDF2
import codecs 

pdf_file = open('6060273.pdf','rb')
#pdf_file = codecs.open('6060273.pdf', 'rb', encoding='utf-8')

read_pdf = PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()
page = read_pdf.getPage(20)
page_content = page.extractText()
print(page_content.encode('utf-8'))

How to convert PDF files encoded in unicode into text using Python 3 and PyPDF2

Answers (1)

Related Questions