Yo Hsiao
Yo Hsiao

Reputation: 678

Correct length of a string of non-English characters in Python3

I am given a string of Hebrew characters (and some other Arabic ones. I know neither of them) in a file

צוֹר‎

When I load this string from file in Python3

fin = open("filename")
x = next(fin).strip()

The length of x appears to be 5

>>> len(x)
5

Its unicode utf-8 encoding is

>>> x.encode("utf-8")
b'\xd7\xa6\xd7\x95\xd6\xb9\xd7\xa8\xe2\x80\x8e'

However, in browsers, it is clear that the length of these Hebrew characters is 3.

How to get the length properly? And why does this happen?

I am aware that Python 3 is by default unicode so I did not expect there is such an issue.

Upvotes: 4

Views: 2034

Answers (4)

Ahmad Yoosofan
Ahmad Yoosofan

Reputation: 981

Open the file with utf-8 encoding.

fin = open('filename','r',encoding='utf-8')

or

with open('filename','r',encoding='utf-8') as fin:
    for line1 in fin:
        print(len(line1.strip()))

Upvotes: 0

lemonhead
lemonhead

Reputation: 5518

The reason is the included text contains the control character \u200e which is an invisible character used as a Left-to-right mark (often used when you have multiple languages mixed to demarcate between the Left-to-Right and Right-to-Left). Additionally, it includes the vowel "character" (the little dot above the second character which shows how to pronounce it).

If you replace the LTR mark with the empty string for instance, you will get the length of 4:

>> x = 'צוֹר'
>> x
'צוֹר\u200e' # note the control character escape sequence
>> print(len(x))
5

>> print(len(x.replace('\u200e', ''))
4

If you want the length of strictly alphabetic character and space characters only, you could do something like re.sub out all non-space non-word characters:

>> print(len(re.sub('[^\w\s]', '', x)))
3

Upvotes: 6

Michael Butscher
Michael Butscher

Reputation: 10969

Unicode characters have different categories. In your case:

>>> import unicodedata
>>> s = b'\xd7\xa6\xd7\x95\xd6\xb9\xd7\xa8\xe2\x80\x8e'.decode("utf-8")
>>> list(unicodedata.category(c) for c in s)
['Lo', 'Lo', 'Mn', 'Lo', 'Cf']
  • Lo: Letter, other (not uppercase, lowercase or such). These are "real" characters
  • Mn: Mark, nonspacing. This is some type of accent character combined with the previous character
  • Cf: Control, format. Here it switches back to left-to-right write direction

Upvotes: 4

Dawid Laszuk
Dawid Laszuk

Reputation: 1978

Have you tried with io libary?

>>> import io
>>> with io.open('text.txt',  mode="r", encoding="utf-8") as f:
     x = f.read()
>>> print(len(x))

You can also try codecs:

>>> import codecs
>>> with codecs.open('text.txt', 'r', 'utf-8') as f:
     x = f.read()
>>> print(len(x))

Upvotes: 0

Related Questions