Reputation: 3744
I want to extract the text contained in a PDF. This is my code to do this:
import textract
doc = textract.process(r"C:\path\to\the\downloaded.pdf", encoding = 'raw_unicode_escape')
f = open('pdf_to_text.txt','wb')
f.write(doc)
This is the output:
\u53cb\u90a6\u4fdd\u96aa\u63a7\u80a1\u6709\u9650\u516c\u53f8
REAL LIFE
REAL IMPACT
A NNUA L REP ORT 2015
STOCK CODE : 1299
VISION & PURPOSE
Our Vision is to be the pre-eminent life
insurance provider in the Asia-Pacific region.
That is our service to our customers and
our shareholders.
Our Purpose is to play a leadership role in
driving economic and social development
across the region. That is our service to
societies and their people.
ABOUT AIA
AIA Group Limited and its subsidiaries (collectively \u201cAIA\u201d
or the \u201cGroup\u201d) comprise the largest independent publicly
listed pan-Asian life insurance group. It has a presence in
18 markets in Asia-Paci\ufb01c \u2013 wholly-owned branches and
subsidiaries in Hong Kong, Thailand, Singapore, Malaysia,
China, Korea, the Philippines, Australia, Indonesia, Taiwan,
... ...
... ...
... ...
As can be seen, it reads some of the "fancy" text (unicode? ascii?) properly, but not all. How do I fix this?
I have tried 5 encoding schemes - utf-8
produces bad results, utf-16
produces the worst results converting everything to illegible text, ascii
produces not-so-bad results but does leave behind a few characters, unicode_escape
produces average results leaving quite a few illegible characters, and raw_unicode_escape
also produces good results but leaves behind a few like ascii
.
This is the link to the PDF which I downloaded to the local drive for the analysis:
https://www.aia.com/content/dam/group/en/docs/annual-report/aia-annual-report-2015-eng.pdf
P.S. Another small unrelated issue is it keeps gaps between letters of a word at times, like A NNUA L REP ORT
in the text snippet above. How can this be fixed?
EDIT: I have found in Pages 10 and 11 of textract's documentation about the possible encoding scheme options. But there are almost a hundred of them:
Possible choices: aliases, ascii, base64_codec, big5, big5hkscs,
bz2_codec, charmap, cp037, cp1006, cp1026, cp1140, cp1250, cp1251,
cp1252, cp1253, cp1254, cp1255, cp1256, cp1257, cp1258, cp424,
cp437, cp500, cp720, cp737, cp775, cp850, cp852, cp855, cp856,
cp857, cp858, cp860, cp861, cp862, cp863, cp864, cp865, cp866,
cp869, cp874, cp875, cp932, cp949, cp950, euc_jis_2004, euc_jisx0213,
euc_jp, euc_kr, gb18030, gb2312, gbk, hex_codec, hp_roman8, hz,
idna, iso2022_jp, iso2022_jp_1, iso2022_jp_2, iso2022_jp_2004,
iso2022_jp_3, iso2022_jp_ext, iso2022_kr, iso8859_1, iso8859_10,
iso8859_11, iso8859_13, iso8859_14, iso8859_15, iso8859_16,
iso8859_2, iso8859_3, iso8859_4, iso8859_5, iso8859_6, iso8859_7,
iso8859_8, iso8859_9, johab, koi8_r, koi8_u, latin_1, mac_arabic,
mac_centeuro, mac_croatian, mac_cyrillic, mac_farsi, mac_greek,
mac_iceland, mac_latin2, mac_roman, mac_romanian, mac_turkish,
mbcs, palmos, ptcp154, punycode, quopri_codec, raw_unicode_escape,
rot_13, shift_jis, shift_jis_2004, shift_jisx0213, string_escape, tactis,
tis_620, undefined, unicode_escape, unicode_internal, utf_16, utf_16_be,
utf_16_le, utf_32, utf_32_be, utf_32_le, utf_7, utf_8, utf_8_sig, uu_codec,
zlib_codec
How can I identify which one is the one used in this particular PDF? And what if even that leaves behind a few characters? Or is it necessarily true that one of these must be THE encoding scheme that does not leave behind any single illegible character?
Upvotes: 1
Views: 1609
Reputation: 3744
This is the way I solved it. I used the removegarbage
function I found here to replace all non-alphanumeric characters.
def removegarbage(str):
# Replace one or more non-word (non-alphanumeric) chars with a space
str = re.sub(r'\W+', ' ', str)
str = str.lower()
return str
doc = removegarbage(doc.decode('raw_unicode_escape'))
If you open the txt file in a basic text editor (like, notepad), you would still see those illegible characters. But if you open it in the console (or possibly even in an advanced text editor?), you would see those characters are gone:
>>>print(doc)
'aia group limited 友邦保險控股有限公司 real life real impact a nnua l rep ort
2015 stock code 1299 vision purpose our vision is to be the pre eminent life
insurance provider in the asia pacific region that is our service to our
customers and our shareholders our purpose is to play a leadership role in
driving economic and social development across the region that is our
service to societies and their people about aia aia group limited and its
subsidiaries collectively aia or the group comprise the largest independent
publicly listed pan asian life insurance group it has a presence in 18
markets in asia pacific wholly owned branches and subsidiaries in hong kong
thailand singapore malaysia china korea the philippines australia indonesia
taiwan ... ... ...
Yeah, the punctuations and capitalizations are gone too, but that is ok as punctuations and capitalizations don't matter for what I intend to do with this extracted text.
Upvotes: 1