Reputation: 82934
The task is to format numbers, currency amounts and dates as unicode
strings in a locale-aware manner.
First naive attempt with numbers gave hope:
Python 2.7 (r27:82525, Jul 4 2010, 09:01:59) [MSC v.1500 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import locale
>>> locale.setlocale(locale.LC_ALL, '')
'English_Australia.1252'
>>> locale.format("%d", 12345678, grouping=True)
'12,345,678'
>>> locale.format(u"%d", 12345678, grouping=True)
u'12,345,678'
>>>
Now try French:
>>> locale.setlocale(locale.LC_ALL, 'French_France')
'French_France.1252'
>>> locale.format("%d", 12345678, grouping=True)
'12\xa0345\xa0678'
>>> locale.format(u"%d", 12345678, grouping=True)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\python27\lib\locale.py", line 190, in format
return _format(percent, value, grouping, monetary, *additional)
File "C:\python27\lib\locale.py", line 211, in _format
formatted, seps = _group(formatted, monetary=monetary)
File "C:\python27\lib\locale.py", line 160, in _group
left_spaces + thousands_sep.join(groups) + right_spaces,
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 0: ordinal not in range(128)
What is happening here?
>>> locale.localeconv() # output edited for brevity
{'thousands_sep': '\xa0', 'mon_thousands_sep': '\xa0', 'currency_symbol': '\x80'}
Wah! Looks a little legacyish. A work-around suggests itself:
>>> locale.format("%d", 12345678, grouping=True).decode(locale.getpreferredencoding())
u'12\xa0345\xa0678'
>>>
UPDATE 1 locale.getpreferredencoding()
is NOT the way to go; use locale.getlocale()[1]
instead:
Python 2.7 (r27:82525, Jul 4 2010, 09:01:59) [MSC v.1500 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import locale
>>> locale.getpreferredencoding(), locale.getlocale()
('cp1252', (None, None))
>>> locale.setlocale(locale.LC_ALL, '')
'English_Australia.1252'
>>> locale.getpreferredencoding(), locale.getlocale()
('cp1252', ('English_Australia', '1252'))
>>> locale.setlocale(locale.LC_ALL, 'russian_russia')
'Russian_Russia.1251'
>>> locale.getpreferredencoding(), locale.getlocale()
('cp1252', ('Russian_Russia', '1251')) #### Whoops! ####
>>>
UPDATE 2 There are very similar problems with the strftime() family and with str.format()
>>> locale.setlocale(locale.LC_ALL, 'french_france')
'French_France.1252'
>>> format(12345678, 'n')
'12\xa0345\xa0678'
>>> format(12345678, u'n') # type triggers cast to unicode somehow
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 2: ordinal not in range(128)
>>> import datetime;datetime.date(1999,12,31).strftime(u'%B') # type is ignored
'd\xe9cembre'
>>>
In all cases, the workaround is to use only str
objects when calling these methods, get a str
result, and decode it using the encoding obtained by locale.getlocale()[1]
Other problems:
(1) It's a considerable nuisance when testing/exploring that the Windows locale names are not only different from POSIX ("fr_FR") but also verbose, and not fully documented. For example, evidently the grouping in India is not "every 3 digits" ... I can't find the locale to use to explore this; attempts like "Hindi" and "Hindi_India" don't work.
(2) Some of the localeconv() data is just plain wrong. E.g. for Korean the currency symbol is given as '\\'
i.e. a single backslash. I'm aware that some 7-bit legacy charsets are not ASCII-compatible and that chr(92) was sometimes used for the local currency symbol, so I expected '\\'
.decode('949') to produce a won symbol, not just u'\\'
I'm aware of modules such as babel
but I don't particularly want to impose a big external dependency like that. Can I get correctness and convenience at the same time? Is there something about the locale
module that I've missed?
Upvotes: 4
Views: 4262
Reputation: 127467
The thing about the locale module you seem to have missed is that it exposes your operating system vendor's (really: C library vendor's) notion of locales. So on Windows, you will have to use Windows locale names, use your OS vendor's documentation to find out what supported names are. Googling for "windows locale name" quickly brings up this list.
That locale.format doesn't really support Unicode is a 2.x limitation; try Python 3.1.
Edit: as for the Won sign, I think the story is this: Microsoft has allocated the Won sign to the same code position as the backslash in MS-DOS (likewise for the Yen sign in Japanese versions). As a consequence, the file separator character was the Won sign, and rendered as such. As they moved to Windows, and later to Unicode, they had to keep supporting this, but they also had to preserve the property that the file separator is the backslash (in particular in the Unicode API). They resolved this conflict so that
Upvotes: 3