carl
carl

Reputation: 4426

python special character decoding/encoding

How can I translate the following string

H.P. Dembinski, B. K\'{e}gl, I.C. Mari\c{s}, M. Roth, D. Veberi\v{c}

into

H. P. Dembinski, B. K\xe9gl, I. C. Mari\u015f, M. Roth, D. Veberi\u010d

?

Upvotes: 3

Views: 394

Answers (2)

Functino
Functino

Reputation: 1935

This code should handle the patterns you have in your example. It's now general enough to add the rest of those codes. Just put them into the table.

#!/usr/bin/python3
import re, unicodedata, sys

table = {
        'v': '\u030C',
        'c': '\u0327',
        "'": '\u0301'
        # etc...
        }

def despecial(s):
    return re.sub(r"\\(" + '|'.join(map(re.escape, table)) + r")\{(\w+)}",
            lambda m: m.group(2) + table[m.group(1)],
            s)

if __name__ == '__main__':
    print(unicodedata.normalize('NFC', despecial(' '.join(sys.argv[1:]))))

Example:

>>> despecial(r"H.P. Dembinski, B. K\'{e}gl, I.C. Mari\c{s}, M. Roth, D. Veberi\v{c}")
'H.P. Dembinski, B. Kégl, I.C. Mariş, M. Roth, D. Veberič'

Example (command line):

$ ./path/to/script.py "Hello W\v{o}rld"
Hello Wǒrld

It puts the appropriate Unicode combining character after the argument given. Specifically: U+0301 COMBINING ACUTE ACCENT, U+0327 COMBINING CEDILLA, and U+030C COMBINING CARON. If you want the string composed, you can just normalize it with unicodedata.normalize or something.

>>> import unicodedata
>>> unicodedata.normalize('NFC', despecial(r"H.P. Dembinski, B. K\'{e}gl, I.C. Mari\c{s}, M. Roth, D. Veberi\v{c}"))
'H.P. Dembinski, B. Kégl, I.C. Mariş, M. Roth, D. Veberič'

That said, I'm sure there's a better way to handle this. It looks like what you have is LaTeX code.

Upvotes: 4

Mark Ransom
Mark Ransom

Reputation: 308196

>>> s = "H.P. Dembinski, B. K\\'{e}gl, I.C. Mari\\c{s}, M. Roth, D. Veberi\\v{c}"
>>> s.replace(u"\\'{e}", u"\xe9").replace(u"\\c{s}", u"\u015f").replace(u"\\v{c}", u"\u010d")
u'H.P. Dembinski, B. K\xe9gl, I.C. Mari\u015f, M. Roth, D. Veberi\u010d'

That of course is the brute-force method. As you say you'll have many possible replacements, here's another way that's still brute-force but cleaner:

>>> table = ((u"\\'{e}", u"\xe9"), (u"\\c{s}", u"\u015f"), (u"\\v{c}", u"\u010d"))
>>> new = s
>>> for pattern, ch in table:
        new = new.replace(pattern, ch)
>>> new
u'H.P. Dembinski, B. K\xe9gl, I.C. Mari\u015f, M. Roth, D. Veberi\u010d'

Since there's a common pattern to the replacement string you can also take advantage of regular expressions.

>>> import re
>>> split = re.split(u"(\\\\['a-z]{[a-z]})", s)
>>> table = {u"\\'{e}": u"\xe9", u"\\c{s}": u"\u015f", u"\\v{c}": u"\u010d"}
>>> ''.join(table[piece] if piece in table else piece for piece in split)
u'H.P. Dembinski, B. K\xe9gl, I.C. Mari\u015f, M. Roth, D. Veberi\u010d'

Upvotes: 1

Related Questions