Why is this code not able to extract unicode text from PDFs properly?

Question

I want to extract the text contained in a PDF. This is my code to do this:

import textract

doc = textract.process(r"C:\path	o	he\downloaded.pdf", encoding = 'raw_unicode_escape')
f = open('pdf_to_text.txt','wb')
f.write(doc)

This is the output:

\u53cb\u90a6\u4fdd\u96aa\u63a7\u80a1\u6709\u9650\u516c\u53f8

REAL LIFE
REAL IMPACT
A NNUA L REP ORT 2015

STOCK CODE : 1299

VISION & PURPOSE
Our Vision is to be the pre-eminent life
insurance provider in the Asia-Pacific region.
That is our service to our customers and
our shareholders.
Our Purpose is to play a leadership role in
driving economic and social development
across the region. That is our service to
societies and their people.

ABOUT AIA
AIA Group Limited and its subsidiaries (collectively \u201cAIA\u201d
or the \u201cGroup\u201d) comprise the largest independent publicly
listed pan-Asian life insurance group. It has a presence in
18 markets in Asia-Paci\ufb01c \u2013 wholly-owned branches and
subsidiaries in Hong Kong, Thailand, Singapore, Malaysia,
China, Korea, the Philippines, Australia, Indonesia, Taiwan,
... ...
... ...
... ...

As can be seen, it reads some of the "fancy" text (unicode? ascii?) properly, but not all. How do I fix this?

I have tried 5 encoding schemes - utf-8 produces bad results, utf-16 produces the worst results converting everything to illegible text, ascii produces not-so-bad results but does leave behind a few characters, unicode_escape produces average results leaving quite a few illegible characters, and raw_unicode_escape also produces good results but leaves behind a few like ascii.

This is the link to the PDF which I downloaded to the local drive for the analysis:

https://www.aia.com/content/dam/group/en/docs/annual-report/aia-annual-report-2015-eng.pdf

P.S. Another small unrelated issue is it keeps gaps between letters of a word at times, like A NNUA L REP ORT in the text snippet above. How can this be fixed?

EDIT: I have found in Pages 10 and 11 of textract's documentation about the possible encoding scheme options. But there are almost a hundred of them:

Possible choices: aliases, ascii, base64_codec, big5, big5hkscs,
bz2_codec, charmap, cp037, cp1006, cp1026, cp1140, cp1250, cp1251,
cp1252, cp1253, cp1254, cp1255, cp1256, cp1257, cp1258, cp424,
cp437, cp500, cp720, cp737, cp775, cp850, cp852, cp855, cp856,
cp857, cp858, cp860, cp861, cp862, cp863, cp864, cp865, cp866,
cp869, cp874, cp875, cp932, cp949, cp950, euc_jis_2004, euc_jisx0213,
euc_jp, euc_kr, gb18030, gb2312, gbk, hex_codec, hp_roman8, hz,
idna, iso2022_jp, iso2022_jp_1, iso2022_jp_2, iso2022_jp_2004,
iso2022_jp_3, iso2022_jp_ext, iso2022_kr, iso8859_1, iso8859_10,
iso8859_11, iso8859_13, iso8859_14, iso8859_15, iso8859_16,
iso8859_2, iso8859_3, iso8859_4, iso8859_5, iso8859_6, iso8859_7,
iso8859_8, iso8859_9, johab, koi8_r, koi8_u, latin_1, mac_arabic,
mac_centeuro, mac_croatian, mac_cyrillic, mac_farsi, mac_greek,
mac_iceland, mac_latin2, mac_roman, mac_romanian, mac_turkish,
mbcs, palmos, ptcp154, punycode, quopri_codec, raw_unicode_escape,
rot_13, shift_jis, shift_jis_2004, shift_jisx0213, string_escape, tactis,
tis_620, undefined, unicode_escape, unicode_internal, utf_16, utf_16_be,
utf_16_le, utf_32, utf_32_be, utf_32_le, utf_7, utf_8, utf_8_sig, uu_codec,
zlib_codec

How can I identify which one is the one used in this particular PDF? And what if even that leaves behind a few characters? Or is it necessarily true that one of these must be THE encoding scheme that does not leave behind any single illegible character?

Why is this code not able to extract unicode text from PDFs properly?

Answers (1)

Related Questions