Barbaros26
Barbaros26

Reputation: 159

Convert hexadecimal character (ligature) to utf-8 character

I had a text content which is converted from a pdf file. There are some unwanted character in the text and I want to convert them to utf-8 characters.

For instance; 'Artificial Immune System' is converted like 'Articial Immune System'. is converted like a one character and I used gdex to learn the ascii value of the character but I don't know how to replace it with the real value in the all content.

Upvotes: 10

Views: 3671

Answers (1)

Martin Geisler
Martin Geisler

Reputation: 73758

I guess what you're seeing are ligatures — professional fonts have glyps that combine several individual characters into a single (better looking) glyph. So instead of writing "f" and "i", as two glyphs, the font has a single "fi" glyph. Compare "fi" (two letters) with "fi" (single glyph).

In Python, you can use the unicodedata module to manipute late Unicode text. You can also exploit the conversion to NFKD normal form to split ligatures:

>>> import unicodedata
>>> unicodedata.name(u'\uFB01')
'LATIN SMALL LIGATURE FI'
>>> unicodedata.normalize("NFKD", u'Arti\uFB01cial Immune System')
u'Artificial Immune System'

So normalizing your strings with NFKD should help you along. If you find that this splits too much, then my best suggestion is to make a small mapping table of the ligatures you want to split and replace the ligatures manually:

>>> ligatures = {0xFB00: u'ff', 0xFB01: u'fi'}
>>> u'Arti\uFB01cial Immune System'.translate(ligatures)
u'Artificial Immune System'

Refer to the Wikipedia article to get a list of ligatures in Unicode.

Upvotes: 18

Related Questions