Kristada673
Kristada673

Reputation: 3744

Why is this code not able to extract unicode text from PDFs properly?

I want to extract the text contained in a PDF. This is my code to do this:

import textract

doc = textract.process(r"C:\path\to\the\downloaded.pdf", encoding = 'raw_unicode_escape')
f = open('pdf_to_text.txt','wb')
f.write(doc)

This is the output:

\u53cb\u90a6\u4fdd\u96aa\u63a7\u80a1\u6709\u9650\u516c\u53f8

REAL LIFE
REAL IMPACT
A NNUA L REP ORT 2015

STOCK CODE : 1299

VISION & PURPOSE
Our Vision is to be the pre-eminent life
insurance provider in the Asia-Pacific region.
That is our service to our customers and
our shareholders.
Our Purpose is to play a leadership role in
driving economic and social development
across the region. That is our service to
societies and their people.

ABOUT AIA
AIA Group Limited and its subsidiaries (collectively \u201cAIA\u201d
or the \u201cGroup\u201d) comprise the largest independent publicly
listed pan-Asian life insurance group. It has a presence in
18 markets in Asia-Paci\ufb01c \u2013 wholly-owned branches and
subsidiaries in Hong Kong, Thailand, Singapore, Malaysia,
China, Korea, the Philippines, Australia, Indonesia, Taiwan,
... ...
... ...
... ...

As can be seen, it reads some of the "fancy" text (unicode? ascii?) properly, but not all. How do I fix this?

I have tried 5 encoding schemes - utf-8 produces bad results, utf-16 produces the worst results converting everything to illegible text, ascii produces not-so-bad results but does leave behind a few characters, unicode_escape produces average results leaving quite a few illegible characters, and raw_unicode_escape also produces good results but leaves behind a few like ascii.

This is the link to the PDF which I downloaded to the local drive for the analysis:

https://www.aia.com/content/dam/group/en/docs/annual-report/aia-annual-report-2015-eng.pdf

P.S. Another small unrelated issue is it keeps gaps between letters of a word at times, like A NNUA L REP ORT in the text snippet above. How can this be fixed?

EDIT: I have found in Pages 10 and 11 of textract's documentation about the possible encoding scheme options. But there are almost a hundred of them:

Possible choices: aliases, ascii, base64_codec, big5, big5hkscs,
bz2_codec, charmap, cp037, cp1006, cp1026, cp1140, cp1250, cp1251,
cp1252, cp1253, cp1254, cp1255, cp1256, cp1257, cp1258, cp424,
cp437, cp500, cp720, cp737, cp775, cp850, cp852, cp855, cp856,
cp857, cp858, cp860, cp861, cp862, cp863, cp864, cp865, cp866,
cp869, cp874, cp875, cp932, cp949, cp950, euc_jis_2004, euc_jisx0213,
euc_jp, euc_kr, gb18030, gb2312, gbk, hex_codec, hp_roman8, hz,
idna, iso2022_jp, iso2022_jp_1, iso2022_jp_2, iso2022_jp_2004,
iso2022_jp_3, iso2022_jp_ext, iso2022_kr, iso8859_1, iso8859_10,
iso8859_11, iso8859_13, iso8859_14, iso8859_15, iso8859_16,
iso8859_2, iso8859_3, iso8859_4, iso8859_5, iso8859_6, iso8859_7,
iso8859_8, iso8859_9, johab, koi8_r, koi8_u, latin_1, mac_arabic,
mac_centeuro, mac_croatian, mac_cyrillic, mac_farsi, mac_greek,
mac_iceland, mac_latin2, mac_roman, mac_romanian, mac_turkish,
mbcs, palmos, ptcp154, punycode, quopri_codec, raw_unicode_escape,
rot_13, shift_jis, shift_jis_2004, shift_jisx0213, string_escape, tactis,
tis_620, undefined, unicode_escape, unicode_internal, utf_16, utf_16_be,
utf_16_le, utf_32, utf_32_be, utf_32_le, utf_7, utf_8, utf_8_sig, uu_codec,
zlib_codec

How can I identify which one is the one used in this particular PDF? And what if even that leaves behind a few characters? Or is it necessarily true that one of these must be THE encoding scheme that does not leave behind any single illegible character?

Upvotes: 1

Views: 1609

Answers (1)

Kristada673
Kristada673

Reputation: 3744

This is the way I solved it. I used the removegarbage function I found here to replace all non-alphanumeric characters.

def removegarbage(str):
    # Replace one or more non-word (non-alphanumeric) chars with a space
    str = re.sub(r'\W+', ' ', str)
    str = str.lower()
    return str

doc = removegarbage(doc.decode('raw_unicode_escape'))

If you open the txt file in a basic text editor (like, notepad), you would still see those illegible characters. But if you open it in the console (or possibly even in an advanced text editor?), you would see those characters are gone:

>>>print(doc)
'aia group limited 友邦保險控股有限公司 real life real impact a nnua l rep ort 
2015 stock code 1299 vision purpose our vision is to be the pre eminent life 
insurance provider in the asia pacific region that is our service to our 
customers and our shareholders our purpose is to play a leadership role in 
driving economic and social development across the region that is our 
service to societies and their people about aia aia group limited and its 
subsidiaries collectively aia or the group comprise the largest independent 
publicly listed pan asian life insurance group it has a presence in 18 
markets in asia pacific wholly owned branches and subsidiaries in hong kong 
thailand singapore malaysia china korea the philippines australia indonesia 
taiwan ... ... ...

Yeah, the punctuations and capitalizations are gone too, but that is ok as punctuations and capitalizations don't matter for what I intend to do with this extracted text.

Upvotes: 1

Related Questions