Reputation: 2036
I am trying to count the number of same words in an Urdu document which is saved in UTF-8.
so for example I have document containing 3 exactly same words separated by space
خُداوند خُداوند خُداوند
I tried to count the words by reading the file using the following code:
file_obj = codecs.open(path,encoding="utf-8")
lst = repr(file_obj.readline()).split(" ")
word = lst[0]
count =0
for w in lst:
if word == w:
count += 1
print count
but the value of count I am getting is 1 while I should get 3.
How does one compare Unicode strings?
Upvotes: 0
Views: 325
Reputation: 1121186
Remove the repr()
from your code. Use repr()
only to create debug output; you are turning a unicode value into a string that can be pasted back into the interpreter.
This means your line from the file is now stored as:
>>> repr(u'خُداوند خُداوند خُداوند\n').split(" ")
["u'\\u062e\\u064f\\u062f\\u0627\\u0648\\u0646\\u062f", '\\u062e\\u064f\\u062f\\u0627\\u0648\\u0646\\u062f', "\\u062e\\u064f\\u062f\\u0627\\u0648\\u0646\\u062f\\n'"]
Note the double backslashes (escaped unicode escapes) and the first string starts with u'
and the last string ends with \\n'
. These values are obviously never equal.
Remove the repr()
, and use .split()
without arguments to remove the trailing whitespace too:
lst = file_obj.readline().split()
and your code will work:
>>> res = u'خُداوند خُداوند خُداوند\n'.split()
>>> res[0] == res[1] == res[2]
True
You may need to normalize the input first; some characters can be expressed either as one unicode codepoint or as two combining codepoints. Normalizing moves all such characters to a composed or decomposed state. See Normalizing Unicode.
Upvotes: 3
Reputation: 7257
Comparing unicode strings in Python:
a = u'Artur'
print(a)
b = u'\u0041rtur'
print(b)
if a == b:
print('the same')
result:
Artur
Artur
the same
Upvotes: 0
Reputation: 33215
Try removing the repr
?
lst = file_obj.readline().split(" ")
The point is that you should at least print
variables like lst
and w
to see what they are.
Upvotes: 1