Reputation: 23
I am trying to take a file and remove all characters that are not in the greek language. We found the unicode values for the alphabet, 880 - 1023, and were able to print out the correct characters with a simple print(unichr(880))
line. The problem is when running this code
greek ='ÏÎ'
for c in greek:
if(unichr(c) >= 880 and unichr(c) <= 1023):
print(c)
Is there a way to enter any letter or symbol that will return a unicode value. We have tested with values inside of the greek range and outside and still get the same error, UnicodeDecodeError: 'ascii' codec cannot decode byte 0xc3 in position 0: ordinal not in range(128)
Upvotes: 2
Views: 832
Reputation: 2691
You must make sure your editor saves your program in UTF-8 encoding. This depends on the text editor you use. If you use IDLE, it will suggest to add the coding
line when you save.
Also, no need to enclose conditions in parentheses, and you can use short .
This is for python 3
# -*- coding: utf-8 -*-
greek ='ÏÎ'
for c in greek:
if 880 <= ord(c) <= 1023:
print(c)
(In my screen, the two characters appear as capital I with tilde accent and capital I with circumflex accent--replace with appropriate characters).
Upvotes: 0
Reputation: 77367
You have several problems. Assuming this is python 2 (since there is no unichr
in python 3 you'd get a different error) your first problem is that you didn't initialize a unicode string in the first place.
>>> greek ='ÏÎ'
>>> len(greek)
4
These aren't 2 unicode characters... they are 4 single byte characters that also happen to be the utf-8 encodings of the unicode characters. Instead do
greek =u'ÏÎ'
Next, these are not the droids, I mean greek characters, you think they are.
>>> ord(greek[0])
207
These are codepage characters in the 128-255 range and are outside of the range you are looking for. Did you want these instead?
>>> greek = u'Ϊΐ'
>>> ord(greek[0])
938
Finally, unichr
goes the wrong way... it converts ordinals to characters but you wanted to go the other way. So,
>>> for c in greek:
... if ord(c) >= 880 and ord(c) <= 1023:
... print(c)
...
Ϊ
ΐ
Upvotes: 1
Reputation: 93
unichr
accepts an integer and returns a unicode character. My first suggestion is that you replace unichr
with ord
here, since you're passing in a string and want to get back an integer that represents a unicode code point.
Now for the Unicode decode errror—I suspect that it's occurring because print
is implicitly trying to encode your unicode string, but doesn't know how to do so. You might have more luck if you explicitly encode your unicode string with utf-8. Try this:
greek = u'ÏÎ'
for c in greek:
if(ord(c) >= 880 and ord(c) <= 1023):
print(c.encode('utf-8'))
Upvotes: 2
Reputation: 21453
You definitely want to use ord
, it is like the inverse function of chr
or unichr
:
>>> x = unichr(1000)
>>> ord(x)
1000
>>> y = unichr(880)
>>> y
u'\u0370'
>>> ord(y)
880
>>> help(ord)
Help on built-in function ord in module __builtin__:
ord(...)
ord(c) -> integer
Return the integer ordinal of a one-character string.
so you pass it a unicode character and it gives you the ordinal of the character.
Upvotes: 1