Greg Tzikas
Greg Tzikas

Reputation: 23

How to input unicode character and get its numeric value

I am trying to take a file and remove all characters that are not in the greek language. We found the unicode values for the alphabet, 880 - 1023, and were able to print out the correct characters with a simple print(unichr(880)) line. The problem is when running this code

greek ='ÏÎ' 
for c in greek:
    if(unichr(c) >= 880 and unichr(c) <= 1023):
        print(c)

Is there a way to enter any letter or symbol that will return a unicode value. We have tested with values inside of the greek range and outside and still get the same error, UnicodeDecodeError: 'ascii' codec cannot decode byte 0xc3 in position 0: ordinal not in range(128)

Upvotes: 2

Views: 832

Answers (4)

Sci Prog
Sci Prog

Reputation: 2691

You must make sure your editor saves your program in UTF-8 encoding. This depends on the text editor you use. If you use IDLE, it will suggest to add the coding line when you save.

Also, no need to enclose conditions in parentheses, and you can use short .

This is for python 3

# -*- coding: utf-8 -*-
greek ='ÏÎ' 
for c in greek:
  if 880 <= ord(c) <= 1023:
    print(c)

(In my screen, the two characters appear as capital I with tilde accent and capital I with circumflex accent--replace with appropriate characters).

Upvotes: 0

tdelaney
tdelaney

Reputation: 77367

You have several problems. Assuming this is python 2 (since there is no unichr in python 3 you'd get a different error) your first problem is that you didn't initialize a unicode string in the first place.

>>> greek ='ÏÎ' 
>>> len(greek)
4

These aren't 2 unicode characters... they are 4 single byte characters that also happen to be the utf-8 encodings of the unicode characters. Instead do

greek =u'ÏÎ'

Next, these are not the droids, I mean greek characters, you think they are.

>>> ord(greek[0])
207

These are codepage characters in the 128-255 range and are outside of the range you are looking for. Did you want these instead?

>>> greek = u'Ϊΐ'
>>> ord(greek[0])
938

Finally, unichr goes the wrong way... it converts ordinals to characters but you wanted to go the other way. So,

>>> for c in greek:
...     if ord(c) >= 880 and ord(c) <= 1023:
...         print(c)
... 
Ϊ
ΐ

Upvotes: 1

zachwf
zachwf

Reputation: 93

unichr accepts an integer and returns a unicode character. My first suggestion is that you replace unichr with ord here, since you're passing in a string and want to get back an integer that represents a unicode code point.

Now for the Unicode decode errror—I suspect that it's occurring because print is implicitly trying to encode your unicode string, but doesn't know how to do so. You might have more luck if you explicitly encode your unicode string with utf-8. Try this:

greek = u'ÏÎ' 
for c in greek:
    if(ord(c) >= 880 and ord(c) <= 1023):
        print(c.encode('utf-8'))

Upvotes: 2

Tadhg McDonald-Jensen
Tadhg McDonald-Jensen

Reputation: 21453

You definitely want to use ord, it is like the inverse function of chr or unichr:

>>> x = unichr(1000)
>>> ord(x)
1000
>>> y = unichr(880)
>>> y
u'\u0370'
>>> ord(y)
880
>>> help(ord)
Help on built-in function ord in module __builtin__:

ord(...)
    ord(c) -> integer

    Return the integer ordinal of a one-character string.

so you pass it a unicode character and it gives you the ordinal of the character.

Upvotes: 1

Related Questions