Reputation: 678

Correct length of a string of non-English characters in Python3

I am given a string of Hebrew characters (and some other Arabic ones. I know neither of them) in a file

צוֹר‎

When I load this string from file in Python3

fin = open("filename")
x = next(fin).strip()

The length of x appears to be 5

>>> len(x)
5

Its unicode utf-8 encoding is

>>> x.encode("utf-8")
b'\xd7\xa6\xd7\x95\xd6\xb9\xd7\xa8\xe2\x80\x8e'

However, in browsers, it is clear that the length of these Hebrew characters is 3.

How to get the length properly? And why does this happen?

I am aware that Python 3 is by default unicode so I did not expect there is such an issue.

Upvotes: 4

Answers (4)

Ahmad Yoosofan

Reputation: 981

Open the file with utf-8 encoding.

fin = open('filename','r',encoding='utf-8')

with open('filename','r',encoding='utf-8') as fin:
    for line1 in fin:
        print(len(line1.strip()))

Upvotes: 0

lemonhead

Reputation: 5518

The reason is the included text contains the control character \u200e which is an invisible character used as a Left-to-right mark (often used when you have multiple languages mixed to demarcate between the Left-to-Right and Right-to-Left). Additionally, it includes the vowel "character" (the little dot above the second character which shows how to pronounce it).

If you replace the LTR mark with the empty string for instance, you will get the length of 4:

>> x = 'צוֹר'
>> x
'צוֹר\u200e' # note the control character escape sequence
>> print(len(x))
5

>> print(len(x.replace('\u200e', ''))
4

If you want the length of strictly alphabetic character and space characters only, you could do something like re.sub out all non-space non-word characters:

>> print(len(re.sub('[^\w\s]', '', x)))
3

Upvotes: 6

Michael Butscher

Reputation: 10969

Unicode characters have different categories. In your case:

>>> import unicodedata
>>> s = b'\xd7\xa6\xd7\x95\xd6\xb9\xd7\xa8\xe2\x80\x8e'.decode("utf-8")
>>> list(unicodedata.category(c) for c in s)
['Lo', 'Lo', 'Mn', 'Lo', 'Cf']

Lo: Letter, other (not uppercase, lowercase or such). These are "real" characters
Mn: Mark, nonspacing. This is some type of accent character combined with the previous character
Cf: Control, format. Here it switches back to left-to-right write direction

Upvotes: 4

Dawid Laszuk

Reputation: 1978

Have you tried with io libary?

>>> import io
>>> with io.open('text.txt',  mode="r", encoding="utf-8") as f:
     x = f.read()
>>> print(len(x))

You can also try codecs:

>>> import codecs
>>> with codecs.open('text.txt', 'r', 'utf-8') as f:
     x = f.read()
>>> print(len(x))

Upvotes: 0

Correct length of a string of non-English characters in Python3

Answers (4)

Related Questions