Reputation: 790
Using Python 2.7.9 on Windows 8.1 Enterprise 64-bit
I'm using the following code to search for any Korean characters ( http://lcweb2.loc.gov/diglib/codetables/9.3.html )
line = ['x'. 'y', 'z', '쭌', 'a']
if any([re.search("[%s-%s]" % ("\xE3\x84\xB1".decode('utf-8'), "\xEC\xAD\x8C".decode('utf-8')), x) for x in line[3:]]):
print "found character"
When ever I run the script and give it the following character 쭌
the console shows 쭌
which is a result of IDLE / Command Prompt being unable to show Korean characters I'm guessing.
쭌
is the last character that I was hoping to match in the regex
So is the above search correct at least? I'd prefer to know I at least have the right pattern to search for and spend time trying to make the console show the proper Korean characters.
I've tried in command prompt to do cph 1252
and nothing. It never prints out "found character" so I wouldn't ever know.
If it helps, the script is receiving text from an IRC channel where Korean is usually spoken.
Upvotes: 2
Views: 5843
Reputation: 8187
If you wanted to use the regex library (not to be confused with re), you could do this:
import regex
regex.search(r'\p{IsHangul}', '오소리')
or in a function to detect at least one Hangul character:
import regex
def is_hangul(value):
if regex.search(r'\p{IsHangul}', value):
return True
return False
print(is_hangul('오소리')) # True
print(is_hangul('mushroom')) # False
print(is_hangul('뱀')) # True
Upvotes: 5
Reputation: 8992
Use Unicode strings (note the "u" prefixes):
import re
line = [u'x', u'y', u'z', u'쭌', u'a']
if any([re.search(u'[\u3131-\ucb4c]', x) for x in line[3:]]):
print "found character"
Upvotes: 2