PaF
PaF

Reputation: 3477

Convert unicode representation of number to ascii string

I've been looking for a simple way to convert a number from a unicode string to an ascii string in python. For example, the input:

input = u'\u0663\u0669\u0668\u066b\u0664\u0667'

Should yield '398.47'.

I started with:

NUMERALS_TRANSLATION_TABLE = {0x660:ord("0"), 0x661:ord("1"), 0x662:ord("2"), 0x663:ord("3"), 0x664:ord("4"), 0x665:ord("5"), 0x666:ord("6"), 0x667:ord("7"), 0x668:ord("8"), 0x669:ord("9"), 0x66b:ord(".")}
input.translate(NUMERALS_TRANSLATION_TABLE)

This solution worked, but I want to be able to support all numbers-related characters in unicode, and not just Arabic. I can translate the digits by going over the unicode string and running unicodedata.digit(input[i]) on each character. I don't like this solution, because it doesn't solve '\u066b' or '\u2013'. I could solve these by using translate as a fallback, but I'm not sure whether there are other such characters that I'm not currently aware of, and so I'm trying to look for a better, more elegant solution.

Any suggestions would be greatly appreciated.

Upvotes: 3

Views: 908

Answers (2)

Martijn Pieters
Martijn Pieters

Reputation: 1122142

Using unicodedata.digit() to look up the digit values for 'numeric' codepoints is the correct method:

>>> import unicodedata
>>> unicodedata.digit(u'\u0663')
3

This uses the Unicode standard information to look up numeric values for a given codepoint.

You could build a translation table by using str.isdigit() to test for digits; this is true for all codepoints for which the standard defines a numeric value. For decimal points, you could look for DECIMAL SEPARATOR in the name; the standard doesn't track these separately by any other metric:

NUMERALS_TRANSLATION_TABLE = {
    i: unicode(unicodedata.digit(unichr(i)))
    for i in range(2 ** 16) if unichr(i).isdigit()}
NUMERALS_TRANSLATION_TABLE.update(
    (i, u'.') for i in range(2 ** 16)
    if 'DECIMAL SEPARATOR' in unicodedata.name(unichr(i), ''))

That produces a table of 447 entries, including 2 decimal points at U+066b ARABIC DECIMAL SEPARATOR and U+2396 DECIMAL SEPARATOR KEY SYMBOL; the latter is really just a made-up symbol to put on the decimal separator key on a numeric keypad where a manufacturer doesn't want to commit themselves to printing a , or . decimal separator on that key.

Demo:

>>> import unicodedata
>>> NUMERALS_TRANSLATION_TABLE = {
...     i: unicode(unicodedata.digit(unichr(i)))
...     for i in range(2 ** 16) if unichr(i).isdigit()}
>>> NUMERALS_TRANSLATION_TABLE.update(
...     (i, u'.') for i in range(2 ** 16)
...     if 'DECIMAL SEPARATOR' in unicodedata.name(unichr(i), ''))
>>> input = u'\u0663\u0669\u0668\u066b\u0664\u0667'
>>> input.translate(NUMERALS_TRANSLATION_TABLE)
'398.47'

Upvotes: 3

idwaker
idwaker

Reputation: 416

>>> from unidecode import unidecode
>>> unidecode(u'\u0663\u0669\u0668\u066b\u0664\u0667')
'398.47'

Upvotes: 0

Related Questions