user975135
user975135

Reputation:

Convert ASCII chars to Unicode FULLWIDTH latin letters in Python?

Can you easily convert between ASCII characters and their Asian full-width Unicode wide characters? Like:

0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~

to

0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!゛#$%&()*+、ー。/:;〈==〉?@[\\]^_‘{|}~

Upvotes: 15

Views: 7185

Answers (7)

Heizi
Heizi

Reputation: 31

I came here looking for a way to convert any FULLWIDTH, HALFWIDTH or IDEOGRAPHIC unicode character to their 'normal' equivalent if they have one.

I ended up writing my own solution because I wanted one that doesn't rely on a manually input translation string, which can only result in missing/incorrect mappings as demonstrated by John Machin answer. Here's the code if it's of use to anyone :

import unicodedata 
unicode_range = (0, 0x10ffff)

# create a dict of where the values are unicode characters
# and the keys are the character names, if they have one.
chars = {}
for uc_point in range(unicode_range[0], unicode_range[1]+1):
    char = chr(uc_point)
    try:
        name = unicodedata.name(char)
        chars[name] = char
    except ValueError: #chars with no name such as control characters
        pass

def normal(name):
    # 'IDEOGRAPHIC COMMA' -> 'COMMA'
    # 'HALFWIDTH IDEOGRAPHIC COMMA' -> 'COMMA'
    # 'LATIN SMALL LETTER A' -> None 
    # so we want to look foor these at the start of character names:
    starts = ['HALFWIDTH IDEOGRAPHIC','IDEOGRAPHIC','FULLWIDTH','HALFWIDTH']
    l = [name[len(start)+1:] for start in starts if name.startswith(start)]
    if l:
        return l[0]
    else:
        return None

# who doesn't love a bit of dict comprehension for the finish:
mapping = {chars[name]: chars[normal(name)] for name in chars if normal(name) in chars}

This gets us a neat mapping that can then be used with str.maketrans() and str.translate() as demonstrated in Nils von Barth's answer:

>>> ''.join(mapping.keys())
'\u3000、。!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~⦅⦆。「」、・ヲァィゥェォャュョッーアイウエオカキクケコサシスセソタチツテトナニヌネノハヒフヘホマミムメモヤユヨラリルレロワンᅠᄀᄁᆪᄂᆬᆭᄃᄄᄅᆰᆱᆲᆳᆴᆵᄚᄆᄇᄈᄡᄉᄊᄋᄌᄍᄎᄏᄐᄑ하ᅢᅣᅤᅥᅦᅧᅨᅩᅪᅫᅬᅭᅮᅯᅰᅱᅲᅳᅴᅵ¢£¬ ̄¦¥₩←↑→↓■○𝍲𝍶'

and

>>> ''.join(mapping.values())
' ,.!"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~⦅⦆.「」,・ヲァィゥェォャュョッーアイウエオカキクケコサシスセソタチツテトナニヌネノハヒフヘホマミムメモヤユヨラリルレロワンㅤㄱㄲㄳㄴㄵㄶㄷㄸㄹㄺㄻㄼㄽㄾㄿㅀㅁㅂㅃㅄㅅㅆㅇㅈㅉㅊㅋㅌㅍㅎㅏㅐㅑㅒㅓㅔㅕㅖㅗㅘㅙㅚㅛㅜㅝㅞㅟㅠㅡㅢㅣ¢£¬¯¦¥₩←↑→↓■○𝍷𝍸'

This solution is also future-proof as it relies on the stdlib module unicodedata which gets updated often with the latest Unicode database.

Upvotes: 3

Nils von Barth
Nils von Barth

Reputation: 3429

Yes; in Python 3, cleanest is to use str.translate and str.maketrans:

HALFWIDTH_TO_FULLWIDTH = str.maketrans(
    '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&()*+,-./:;<=>?@[]^_`{|}~',
    '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!゛#$%&()*+、ー。/:;〈=〉?@[]^_‘{|}~')

def halfwidth_to_fullwidth(s):
    return s.translate(HALFWIDTH_TO_FULLWIDTH)

In Python 2, str.maketrans is instead string.maketrans and doesn’t work with Unicode characters, so you need to make a dictionary, as Ignacio Vazquez notes above.

Upvotes: 3

John Machin
John Machin

Reputation: 82934

The range of fullwidth ASCII replacements starts at U+FF01, not U+FF00. U+FF00 is (strangely) not defined. To get a fullwidth SPACE, you need to use U+3000 IDEOGRAPHIC SPACE. Don't rely on typing what appears to be what you want followed by visual inspection of characters to check your mapping -- unicodedata.name is your friend. This code:

# coding: utf-8
from unicodedata import name as ucname

# OP
normal = u"""0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~"""
wide = u"""0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!゛#$%&()*+、ー。/:;〈=〉?@[\\]^_‘{|}~"""
# above after editing (had = twice)
widemapOP = dict((ord(x[0]), x[1]) for x in zip(normal, wide))

# Ingacio V
normal = u' 0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~'
wide = u' 0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!゛#$%&()*+、ー。/:;〈=〉?@[\\]^_‘{|}~'
widemapIV = dict((ord(x[0]), x[1]) for x in zip(normal, wide))

# JM
widemapJM = dict((i, i + 0xFF00 - 0x20) for i in xrange(0x21, 0x7F))
widemapJM[0x20] = 0x3000 # IDEOGRAPHIC SPACE

maps = {'OP': widemapOP, 'IV': widemapIV, 'JM': widemapJM}.items()

for i in xrange(0x20, 0x7F):
    a = unichr(i)
    na = ucname(a, '?')
    for tag, widemap in maps:
        w = a.translate(widemap)
        nw = ucname(w, '?')
        if nw != "FULLWIDTH " + na:
            print "%s: %04X %s => %04X %s" % (tag, i, na, ord(w), nw)

when run shows what you have really got: some missing mappings and some idiosyncratic mappings:

JM: 0020 SPACE => 3000 IDEOGRAPHIC SPACE
IV: 0020 SPACE => 3000 IDEOGRAPHIC SPACE
OP: 0020 SPACE => 0020 SPACE
IV: 0022 QUOTATION MARK => 309B KATAKANA-HIRAGANA VOICED SOUND MARK
OP: 0022 QUOTATION MARK => 309B KATAKANA-HIRAGANA VOICED SOUND MARK
IV: 0027 APOSTROPHE => 0027 APOSTROPHE
OP: 0027 APOSTROPHE => 0027 APOSTROPHE
IV: 002C COMMA => 3001 IDEOGRAPHIC COMMA
OP: 002C COMMA => 3001 IDEOGRAPHIC COMMA
IV: 002D HYPHEN-MINUS => 30FC KATAKANA-HIRAGANA PROLONGED SOUND MARK
OP: 002D HYPHEN-MINUS => 30FC KATAKANA-HIRAGANA PROLONGED SOUND MARK
IV: 002E FULL STOP => 3002 IDEOGRAPHIC FULL STOP
OP: 002E FULL STOP => 3002 IDEOGRAPHIC FULL STOP
IV: 003C LESS-THAN SIGN => 3008 LEFT ANGLE BRACKET
OP: 003C LESS-THAN SIGN => 3008 LEFT ANGLE BRACKET
IV: 003E GREATER-THAN SIGN => 3009 RIGHT ANGLE BRACKET
OP: 003E GREATER-THAN SIGN => 3009 RIGHT ANGLE BRACKET
IV: 005C REVERSE SOLIDUS => 005C REVERSE SOLIDUS
OP: 005C REVERSE SOLIDUS => 005C REVERSE SOLIDUS
IV: 0060 GRAVE ACCENT => 2018 LEFT SINGLE QUOTATION MARK
OP: 0060 GRAVE ACCENT => 2018 LEFT SINGLE QUOTATION MARK

Upvotes: 10

werewindle
werewindle

Reputation: 3029

Those "wide" characters are named FULLWIDTH LATIN LETTER: http://www.unicodemap.org/range/87/Halfwidth%20and%20Fullwidth%20Forms/

They have range 0xFF00 - -0xFFEF. You can make look-up table or just add 0xFEE0 to ASCII code.

Upvotes: 10

Ignacio Vazquez-Abrams
Ignacio Vazquez-Abrams

Reputation: 798754

Yes.

>>> normal = u' 0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~'
>>> wide = u' 0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!゛#$%&()*+、ー。/:;〈=〉?@[\\]^_‘{|}~'
>>> widemap = dict((ord(x[0]), x[1]) for x in zip(normal, wide))
>>> print u'Hello, world!'.translate(widemap)
Hello、 world!

Upvotes: 3

tchrist
tchrist

Reputation: 80415

This goes one way:

#!/usr/bin/env perl
# uniwide
use utf8;
use strict;
use warnings;
use open qw(:std :utf8);

while (<>) {    
    s/\s/\x{A0}\x{A0}/g if tr
      <!"#$%&´()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~¢£>
      <!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~¢£>;;    
} continue {
      print;   
} 

close(STDOUT) || die "can't close stdout: $!";

And this goes the other:

#!/usr/bin/env perl
# uninarrow
use utf8;
use strict;
use warnings;
use open qw(:std :utf8);

while (<>) {     
    s/  / /g if tr
      <!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~¢£>
      <!"#$%&´()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~¢£>    
} continue {
      print;    
} 

close(STDOUT) || die "can't close stdout: $!";

Upvotes: 0

sorin
sorin

Reputation: 170508

UTF-8 Unicode codes for ASCII are exactly the same. For UTF-16 add a zero before/after (LE/BE)

Or in python mystr.encode("utf-8")

Upvotes: -3

Related Questions