Reputation: 199
I am trying to convert PDFs into text files using Python 3 and PyPDF2 library. But PDFs are mainly written in Korean so it seems to be encoded in 'utf-8' before processing PDF text. But either reading PDF files with "open" function or one with "codecs" function doesn't seem to work at all to extract appropriately 'utf-8'-encode text. Do you have any ideas to extract text from PDF files by using Python 3 and any other relevant Python libraries? Thanks in advance!
(You can download an example file via http://dart.fss.or.kr/pdf/download/pdf.do?rcp_no=20180402005019&dcm_no=6060273)
import PyPDF2
import codecs
pdf_file = open('6060273.pdf','rb')
#pdf_file = codecs.open('6060273.pdf', 'rb', encoding='utf-8')
read_pdf = PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()
page = read_pdf.getPage(20)
page_content = page.extractText()
print(page_content.encode('utf-8'))
Upvotes: 1
Views: 5173
Reputation: 178
It seems to me that your problem is rather related to your fonts sources installed on your machine. The basic package which comes with PyPDF does not include whole universe of UTF8 in advance due to the fact that having such all options included library could increase the size of it. However you can install the necessary fonts in the directory.
Upvotes: 1