Reputation: 48616
If I have a Python Unicode string that contains combining characters, len
reports a value that does not correspond to the number of characters "seen".
For example, if I have a string with combining overlines and underlines such as u'A\u0332\u0305BC'
, len(u'A\u0332\u0305BC')
reports 5; but the displayed string is only 3 characters long.
How do I get the "visible" — that is, number of distinct positions occupied by the string the user sees — length of a Unicode string containing combining glyphs in Python?
Upvotes: 13
Views: 1982
Reputation: 308530
The unicodedata
module has a function combining
that can be used to determine if a single character is a combining character. If it returns 0
you can count the character as non-combining.
import unicodedata
len(u''.join(ch for ch in u'A\u0332\u0305BC' if unicodedata.combining(ch) == 0))
or, slightly simpler:
sum(1 for ch in u'A\u0332\u0305BC' if unicodedata.combining(ch) == 0)
Edit: as pointed out in the comments, there are code points other than combining marks that modify a character without being a character themselves that should not be in the count. Here's a more robust version of the above:
modifier_categories = set(['Mc', 'Mn'])
sum(1 for ch in u'A\u0332\u0305BC' if unicodedata.category(ch) not in modifier_categories)
We can use another Python trick to make that even simpler, taking advantage of True==1
and False==0
:
sum(unicodedata.category(ch) not in modifier_categories for ch in u'A\u0332\u0305BC')
Upvotes: 4
Reputation: 9096
Combining characters are not the only zero-width characters:
>>> sum(1 for ch in u'\u200c' if unicodedata.combining(ch) == 0)
1
("\u200c"
or ""
is zero-width non-joiner; it's a non-printing character.)
In this case the regex module does not work either:
>>> len(regex.findall(r'\X', u'\u200c'))
1
I found wcwidth that handles the above case correctly:
>>> from wcwidth import wcswidth
>>> wcswidth(u'A\u0332\u0305BC')
3
>>> wcswidth(u'\u200c')
0
But still doesn't seem to work with user 596219's example:
>>> wcswidth('각')
4
Upvotes: 3
Reputation: 104102
If you have a regex flavor that supports matching grapheme, you can use \X
While the default Python re module does not support \X
, Matthew Barnett's regex module does:
>>> len(regex.findall(r'\X', u'A\u0332\u0305BC'))
3
On Python 2, you need to use u
in the pattern:
>>> regex.findall(u'\\X', u'A\u0332\u0305BC')
[u'A\u0332\u0305', u'B', u'C']
>>> len(regex.findall(u'\\X', u'A\u0332\u0305BC'))
3
Upvotes: 5