Reputation: 286

Python: How to compare a unicode with unicode within variable

SOLVED

I solved the problem, thanks all for your time.

First of all, these are the requirements:

The comparison MUST be within variables. (Compare 2 variables contain unicode)
The version of Python MUST be 2.x , I know version 3 has solved this problem, but unfortunately it won't work with me.

So hello, I have a bot coded with python, and I would like to make it compare 2 non-English letters (unicode).

The problem I have is, the letters MUST be within variables, so I can't use:

u'letter'

Both letters I would like to compare MUST be within variables.

I have tried:

letter1 == letter2

it's showing this error: E:\bots\KiDo\KiDo.py:23: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal import sys

and always returns False even the 2 letters are the same. So I guess it means I'm comparing 2 unicode letters.

And tried:

letter = unicode(letter)

but it shows this error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xd9 in position 0: ordinal not in range(128)

I have searched all over Google, but all I could find is using u' ', but this won't work with the variables.

Thank you.

Comparison Code:

word1 = parameters.split()[0]
word2 = parameters.split()[1]
word3 = parameters.split()[2]
word4 = parameters.split()[3]
word5 = parameters.split()[4]
if word1[0] == letter:
    if word2[0] == letter:
        if word3[0] == letter:
            if word4[0] == letter:
                if word5[0] == letter:
                    reply(type, source,u'True')

Upvotes: 2

Answers (4)

Mark Tolonen

Reputation: 177620

I think you don't understand Unicode vs. an encoding.

Refer to this article: http://www.joelonsoftware.com/articles/Unicode.html

Note the following... UTF-8 is an encoding of Unicode, but is not Unicode. The # coding: utf-8 declaration at the top of the source below declares the encoding of the source file as saved on disk. a = u'ç' declares a Unicode variable. b = 'ç' is a byte string in the source encoding (utf-8).

Note that repr displays different source-like representation of the string so you can tell the difference. type indicates the object type.

# coding: utf-8
a = u'ç'
b = 'ç'

print a
print b
print repr(a)
print repr(b)
print type(a)
print type(b)
print a==b                  # Not comparing same types.
print a==b.decode('utf8')   # Comparing both as Unicode strings.
print a.encode('utf8')==b   # Comparing both as byte strings.

a and b print the same, but are not the same:

ç
ç
u'\xe7'
'\xc3\xa7'
<type 'unicode'>
<type 'str'>
C:\Users\metolone\Desktop\Script1.py:11: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  print a==b
False
True
True

Your letter1 and letter2 are two different types of strings.

Here's a complete example reading a word list from a file and taking input from a user:

import sys
import codecs

# The word list was saved in UTF-8 encoding.  It can be in any encoding
# as long as the correct one is specified when reading it in.
# `codecs.open` will convert the input to Unicode.
with codecs.open('words.txt','r',encoding='utf8') as f:
    word_list = f.read().strip().splitlines()

print 'word_list and type:',word_list,type(word_list[0])

# Different consoles can have different input encodings.  Let's see what it is.
print 'My terminal encoding:',sys.stdin.encoding

# Read a word in the input encoding.  We'll convert to Unicode later.
word = raw_input('Word? ')
print 'word, content and type:',word,repr(word),type(word)

# Now decode the input to Unicode.
word = word.decode(sys.stdin.encoding)
print 'converted word, content and type:',word,repr(word),type(word)

# Compare the two Unicode strings
print 'Comparison:',word in word_list

Output from US Windows console. Note that different consoles have different encodings. Linux is usually UTF-8. Non-US Windows console's can be different.

word_list and type: [u'\ufeffadi\xf3s', u'ping\xfcino'] <type 'unicode'>
My terminal encoding: cp437
Word? pingüino
word, content and type: pingüino 'ping\x81ino' <type 'str'>
converted word, content and type: pingüino u'ping\xfcino' <type 'unicode'>
Comparison: True

Upvotes: 0

Steve Barnes

Reputation: 28370

If you need to compare single letters you could always compare the actual value using ord(a)==ord(b).

In answer to the example posted:

>>> def check(b):
...    a = u'ي'
...    return (b==a, ord(a), ord(b), ord(a)==ord(b))
... 
>>> check(u'ي')
(True, 1610, 1610, True)
>>>

You do need to be consistent in marking unicode as unicode, i.e. putting the u before the quotes.

Upvotes: -1

Pablo

Reputation: 1319

Look, the letter ç (a char that is not presented in ASCII) may be represented as a str object or as an unicode object (maybe you are a little confused about what unicode means).

Also, if you are trying to create an unicode object that is not present in ASCII table, you must pass another encoding table:

unicode('ç')

This will raise an UnicodeDecodeError because 'ç' is not in ASCII, but

unicode('ç', encoding='utf-8')

will work, because 'ç' is presented in UTF-8 encoding table (as your Arabic letters may be).

You can compare unicode objects with unicode objects as the same way you can compare str objects with str objects, and all this must work fine.

Also, you can compare a str object with unicode object but this is error prone if you are comparing not ASCII characters: 'ç' as a str is '\xc3\xa7' but as unicode it is just '\xe7' (returning False in a comparison).

So @Karsa may be really right. The problem is with your 'variables' (in Python, a better word is objects). You must certify that you are comparing just str or just unicode objects.

So, a better code could be:

#-*- coding: utf-8 -*-

def compare_first_letter(phrase, compare_letter):
    # making all unicode objects, with utf-8 codec
    compare_letter = unicode(compare_letter,encoding='utf-8')
    phrase = unicode(phrase,encoding='utf-8')
    # taking the first letters of each word in phrase
    first_letters = [word[0] for word in phrase.split()]
    # comparing the  first letters with the letter you want
    for letter in first_letters:
        if letter != compare_letter:
            return False
    return True # or your reply function

letter = 'ç'
phrase_1 = "one two three four"
phrase_2 = "çarinha çapoca çamuca"

print(compare_first_letter(phrase_1,letter))
print(compare_first_letter(phrase_2,letter))

Upvotes: 3

Kasravnd

Reputation: 107287

this is my try base on any thing you say :

>>> b=u'letter'
>>> a=u'letter'
>>> a==b
True
>>> a=u'letter2'
>>> a==b
False

so im sure that there is a problem with your variables ! i suggest before you compare them try to print them ! to see whats under the variables !

Upvotes: 0

Python: How to compare a unicode with unicode within variable

Answers (4)

Related Questions