Reputation: 4239
I am trying to get a pdf page with a particular string and the string is:
"statement of profit or loss"
and I'm trying to accomplish this using following regex:
re.search('statement of profit or loss', text, re.IGNORECASE)
But even though the page contained this string "statement of profit or loss" the regex returned None. On further investigating the document, I found that the characters 'fi' in the "profit" as written in the document are more congested. When I copied it from the document and pasted it in my code it worked fine.
So, If I copy "statement of profit or loss" from document and paste it in re.search() in my code, it works fine. But if I write "statement of profit or loss" manually in my code, re.search() returns none. How can I avoid this behavior?
Upvotes: 2
Views: 99
Reputation: 22478
The 'congested' characters copied from your PDF are actually a single character: the 'fi ligature' U+FB01: fi
.
Either it was entered as such in the source document, or the typesetting engine that was used to create the PDF, replaced the combination f+i
by fi
.
Combining two or more characters into a single glyph is a fairly usual operation for "nice typesetting", and is not limited to fi
, fl
, ff
, and fj
, although these are the most used combinations. (That is because in some fonts the long overhang of the f
glyph jarringly touches or overlaps the next character.) Actually, you can have any amount of ligatures; some Adobe fonts use a single ligature for Th
.
Usually this is not a problem with text extracting, because in the PDF it can be specified that certain glyphs must be decoded as a string of characters – the original characters. So, possibly your PDF does not contain such a definition, or the typesetting engine did not bother because the single character fi
is a valid Unicode character on itself (although it is highly advised not to use it).
You can work around this by explicitly cleaning up your text strings before processing any further:
text = text.replace('fi', 'fi')
– repeat this for other problematic ligatures which have a Unicode codepoint: fl
, ff
, ffi
, ffl
(I possibly missed some more).
Upvotes: 3