Rowling
Rowling

Reputation: 213

UnicodeEncodeError when extract text from PDF in Python

I am trying to extract the content from PDF file and store it in text file. my code works fine when for page 1 in my PDF file(pdfreader.getPage(0)), but when I do it for page 2, I got an error:

UnicodeEncodeError: 'gbk' codec can't encode character '\u2122' in position 1831: illegal multibyte sequence

I am not sure what does this mean since I am new to Python, and my code is:

import PyPDF2
pdffileobj=open('meetingminutes.pdf','rb')
pdfreader=PyPDF2.PdfFileReader(pdffileobj)
pageobj=pdfreader.getPage(1)

content=pageobj.extractText()
file=open('pdftotext.txt','w')
file.write(str(content))
file.close()

Upvotes: 0

Views: 4013

Answers (1)

Alan
Alan

Reputation: 3042

TL;DR: file=open('pdftotext.txt','w', encoding="utf-16")

PyPDF2 is reading one or more elements on the page as UTF-16 (instead of UTF-8 or ASCII) and assuming this means there is Chinese text present. When you try and write a string in Python3, it defaults to UTF-8. This will fail as there is a UTF-16 character present in the string.

'gbk' is Chinese encoding. GBK is an extension of the GB2312 character set for simplified Chinese characters ... GBK has been extended by Microsoft in Code page 936/1386.

'\u2122' is the UTF-16 code for a trademark symbol. Not sure why PyPDF2 is using the UTF-16 code and not the UTF-8 identifier. You could in theory do a replace on the string and down-convert it to the correct UTF-8 identifier ("e284a2") or even just to "TM".

You could tell Python to treat everything in the script as UTF-16 by adding a coding header to the script (see PEP 263 Python Source Code Encodings):

# coding=utf-16
import PyPDF2

The easiest solution though is to change the encoding on the output:

file=open('pdftotext.txt','w', encoding="utf-16")

Upvotes: 1

Related Questions